7    Creating Locales

This chapter explains how to develop a locale, which provides information appropriate for a particular combination of language, territory, and codeset. You use the localedef command to create locales from the following files:

7.1    Creating a Character Map Source File for a Locale

A charmap file defines symbols for character binary encodings. The localedef command uses this file to map character symbols in a locale source file to the character encodings. Example 7-1 shows a fragment of the ISO8859-1.cmap source file that is used in the fr_FR.ISO8859-1@example locale being developed in this chapter. Section E.1 contains the ISO8859-1.cmap file in its entirety.

Example 7-1:  The charmap File for a Sample Locale

#      [1]
#     Charmap for ISO 8859-1 codeset     [1]
#      [1]
 
<code_set_name>                 "ISO8859-1"     [2]
<mb_cur_max>                    1      [2]
<mb_cur_min>                    1      [2]
<escape_char>                   \      [2]
<comment_char>                  #      [2]
 
CHARMAP      [3]
 
#  Portable characters and other standard     [1]
#  control characters                         [1]
 
<NUL>                           \x00      [4]
<SOH>                           \x01
<STX>                           \x02
<ETX>                           \x03
<EOT>                           \x04
<ENQ>                           \x05
<ACK>                           \x06
<BEL>                           \x07
<alert>                         \x07
<backspace>                     \x08
<tab>                           \x09
<newline>                       \x0a
<vertical-tab>                  \x0b
<form-feed>                     \x0c
<carriage-return>               \x0d
<SO>                            \x0e

.
.
.
<zero> \x30 [4] <one> \x31 <two> \x32 <three> \x33 <A> \x41 <B> \x42 <C> \x43 <D> \x44
.
.
.
<underscore> \x5f [4] <low-line> \x5f <grave-accent> \x60 <a> \x61 <b> \x62 <c> \x63 <d> \x64
.
.
.
# Extended control characters [1] # (names taken from ISO 6429) [1]   <PAD> \x80 [4] <HOP> \x81 <BPH> \x82 <NBH> \x83 <IND> \x84
.
.
.
# Other graphic characters [1]   <nobreakspace> \xa0 [4] <inverted-exclamation-mark> \xa1
.
.
.
END CHARMAP [5]    

  1. Comment line

    By default, the comment character is the number sign (#). You can override this default with a <comment_char> definition (see 2). [Return to example]

  2. Keyword declarations

    This example provides entries for all valid declarations and specifies default values for all but <code_set_name>. Usually, you specify a declaration only when you want to override its default value. In this example, the declarations for <escape_char> and <comment_char> specify the default values for the escape character and comment character, respectively. The value for <mb_cur_max>, the maximum length (in bytes) of a character, is 1 for this particular charmap file. The value for <mb_cur_min>, the minimum length (in bytes) of a character, must be 1 in charmap files for all locales. (All locales include characters in the Portable Character Set, which defines single-byte characters.)

    The <code_set_name> value is the value returned on the nl_langinfo(CODESET) call made by applications that bind to the locale at run time. [Return to example]

  3. Header marking start of character maps [Return to example]

  4. Symbol-to-coding maps for characters

    Each character map consists of a symbolic name and encoding. The name and encoding are separated by one or more spaces.

    A symbolic name begins with the left angle bracket (<) and ends with the right angle bracket (>). The characters between the angle brackets can be any characters from the Portable Character Set, except for control and space characters. If the name includes more than one right angle bracket (>), all but the last one must be preceded by the value of <escape_character>. A symbolic name cannot exceed 128 bytes in length.

    An encoding can be one or more decimal, octal, or hexadecimal constants. (Multiple constants apply to multibyte encodings.) The constants have the following formats:

    You can define multiple character map entries (each with a different symbolic name) for the same encoding value. This example does not define multiple symbolic names for the same encoding value. [Return to example]

  5. Trailer marking end of character maps [Return to example]

The source files for codesets with multibyte characters have more complex character maps. Example 7-2 shows a subset of character map entries from a source file for the Japanese SJIS codeset. This source file specifies entries from several character sets that must be supported within the same codeset.

Example 7-2:  Fragment from a charmap File for a Multibyte Codeset

# SJIS charmap
#
<code_set_name> "SJIS"   [1]
<mb_cur_min>    1    [2]
<mb_cur_max>    2    [3]
CHARMAP
#
# CS0: ASCII
#

.
.
.
<commercial-at> \x40 [4] <A> \x41 [4] <B> \x42 [4]
.
.
.
# # CS1: JIS X0208-1983 for ShiftJIS. # <zenkaku-space> \x81\x40 [5] <j0101>...<j0163> \x81\x40 [5] <j0164>...<j0194> \x81\x80 [5]
.
.
.
# # UDC Area in JIS X0208 plane # <u8501>...<u8563> \xeb\x40 [6] <u8564>...<u8594> \xeb\x80 [6] <u8601>...<u8663> \xeb\x9f [6]
.
.
.
# # CS2: JIS X0201 (so-called Hankaku-Kana) # <kana-fullstop> \xa1 [7]
.
.
.
<kana-conjunctive> \xa5 [7] <kana-WO> \xa6 [7] <kana-a> \xa7 [7]
.
.
.
END CHARMAP

  1. Codeset name [Return to example]

  2. Minimum number of bytes per character

    This value must be 1. [Return to example]

  3. Maximum number of bytes per character

    In SJIS, the largest multibyte character is 2 bytes in length. [Return to example]

  4. Symbols and encodings for ASCII characters [Return to example]

  5. Symbols and encodings for SJIS characters

    Note how character symbols are specified as a range and how two hexadecimal values determine the encoding for a 2-byte character.

    When symbols are specified as a range of symbol values, the specified character encoding applies to the first symbol in the range. The localedef command automatically increments both the symbol value and the encoding value to create symbols and encodings for all characters in the range. [Return to example]

  6. Maps for user-defined characters within the SJIS codeset

    These maps establish ranges of encodings for which users can later define characters. [Return to example]

  7. Maps for the single-byte characters of the Hankaku-Kana character set [Return to example]

Refer to charmap(4) for a complete list of rules that apply to character map source files.

Note

The symbolic names for characters in character map source files are in the process of becoming standardized. A future revision of the X/Open UNIX standard will likely specify both long and short symbolic names for characters.

The symbolic names for characters shown in this example are not necessarily the names being proposed for adoption by any standards group.

7.2    Creating Locale Definition Source Files

A locale definition source file defines data that is specific to a particular language and territory. The source file is organized into sections, one for each category of locale data being defined. Example 7-3 shows the structure of a locale definition source file in pseudocode. The sections for locale categories are discussed in more detail following the example.

Example 7-3:  Structure of Locale Source Definition File

# comment-line    [1]
 
comment_char      <char_symbol1>   [2]
escape_char       <char_symbol2>   [3]
 
CATEGORY_NAME    [4]
 
category_definition-statement   [5]
category_definition-statement   [5]

.
.
.
END CATEGORY_NAME [6]
.
.
.
[7]

  1. Comment line

    The number sign (#) is the default comment character. You can specify comments as entire lines by entering the comment character in the first column of the line. You cannot specify comments on the same lines as definition statements in locale source files. In this respect, locale source files differ from character map source files. [Return to example]

  2. Redefinition of comment character

    You can override the default comment character with an entry line that begins with the comment_char keyword, followed by the symbol for the desired character. The character symbol is defined in the character map (charmap) source file for the locale. [Return to example]

  3. Redefinition of escape character

    The escape character, by default the backslash (\), is used in decimal, hexadecimal, and octal constants and to indicate when definition statements are continued to the next line of the source file. You can override the default escape character with an entry line that begins with the escape_char keyword, followed by one or more blank characters, then the symbol for the desired character. The character symbol is defined in the character map source file for the locale. [Return to example]

  4. Header for locale category section

    Section headers correspond to category names, which are LC_CTYPE, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_MESSAGES, and LC_TIME. [Return to example]

  5. Definition statement for the category

    The format of these statements varies from one category to the next. In general, a statement begins with a keyword, followed by one or more spaces or tabs, then the definition itself.

    In place of any category definition statements, you can include a copy statement to include definition statements in another locale source file. For example:

    copy en_US.ISO8859-1
    

    If you include a copy statement, you can include no other statements in the category. [Return to example]

  6. Trailer for locale category section

    Section trailers start with the END keyword, followed by the category name. [Return to example]

  7. You can include sections for all locale categories or only a subset of categories. If you omit a section for a locale category from the source file, the definition for the omitted category is the same as defined for the POSIX, or C, locale. [Return to example]

The following sections describe specific locale categories and include parts of the fr_FR.ISO8859-1@example.src locale source file. Section E.2 contains this source file in its entirety.

7.2.1    Defining the LC_CTYPE Locale Category

The LC_CTYPE section of a locale source file defines character classes and character attributes used in operations such as case conversion. Example 7-4 shows the definition for this section.

Example 7-4:  LC_CTYPE Category Definition

#############
LC_CTYPE     [1]
#############
 
upper   <A>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;\
        <N>;<O>;<P>;<Q>;<R>;<S>;<T>;<U>;<V>;<W>;<X>;<Y>;<Z>;\
        <A-grave>;\

.
.
.
<U-diaeresis> [2]   lower <a>;<b>;<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;\ <n>;<o>;<p>;<q>;<r>;<s>;<t>;<u>;<v>;<w>;<x>;<y>;<z>;\ <a-grave>;\
.
.
.
<u-diaeresis> [2]   space <tab>;<newline>;<vertical-tab>;<form-feed>;\ <carriage-return>;<space> [2]   cntrl <NUL>;<SOH>;<STX>;<ETX>;<EOT>;<ENQ>;<ACK>;\ <alert>;<backspace>;<tab>;<newline>;<vertical-tab>;\ <form-feed>;<carriage-return>;\
.
.
.
<SOS>;<SGCI>;<SCI>;<CSI>;<ST>;<OSC>;<PM>;<APC> [2]   graph <exclamation-mark>;<quotation-mark>;<number-sign>;\
.
.
.
<u-circumflex>;<u-diaeresis>;<y-acute>;<thorn-icelandic>;<y-diaeresis> [2]   # print class includes everything in the graph class above, plus <space>.   print <exclamation-mark>;<quotation-mark>;<number-sign>;\
.
.
.
<u-circumflex>;<u-diaeresis>;<y-acute>;<thorn-icelandic>;<y-diaeresis>;\ <space> [2]   punct <exclamation-mark>;<quotation-mark>;<number-sign>;\ <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\ <left-parenthesis>;<right-parenthesis>;<asterisk>;\ <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\ <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\ <greater-than-sign>;<question-mark>;<commercial-at>;\ <left-square-bracket>;<backslash>;<right-square-bracket>;\ <circumflex>;<underscore>;<grave-accent>;<left-brace>;\ <vertical-line>;<right-brace>;<tilde> [2]   digit <zero>;<one>;<two>;<three>;<four>;\ <five>;<six>;<seven>;<eight>;<nine> [2]   xdigit <zero>;<one>;<two>;<three>;<four>;\ <five>;<six>;<seven>;<eight>;<nine>;\ <A>;<B>;<C>;<D>;<E>;<F>;\ <a>;<b>;<c>;<d>;<e>;<f> [2]   blank <space>;<tab> [2]   toupper (<a>,<A>);(<b>,<B>);(<c>,<C>);(<d>,<D>);(<e>,<E>);\ (<f>,<F>);(<g>,<G>);(<h>,<H>);(<i>,<I>);(<j>,<J>);\ (<k>,<K>);(<l>,<L>);(<m>,<M>);(<n>,<N>);(<o>,<O>);\ (<p>,<P>);(<q>,<Q>);(<r>,<R>);(<s>,<S>);(<t>,<T>);\ (<u>,<U>);(<v>,<V>);(<w>,<W>);(<x>,<X>);(<y>,<Y>);\ (<z>,<Z>);\ (<a-grave>,<A-grave>);\ (<a-circumflex>,<A-circumflex>);\ (<ae-ligature>,<AE-ligature>);\ (<c-cedilla>,<C-cedilla>);\ (<e-grave>,<E-grave>);\ (<e-acute>,<E-acute>);\ (<e-circumflex>,<E-circumflex>);\ (<e-diaeresis>,<E-diaeresis>);\ (<i-circumflex>,<I-circumflex>);\ (<i-diaeresis>,<I-diaeresis>);\ (<o-circumflex>,<O-circumflex>);\ (<u-grave>,<U-grave>);\ (<u-circumflex>,<U-circumflex>);\ (<u-diaeresis>,<U-diaeresis>) [3]   # tolower class is the inverse of toupper.   tolower (<A>,<a>);(<B>,<b>);(<C>,<c>);(<D>,<d>);(<E>,<e>);\ (<F>,<f>);(<G>,<g>);(<H>,<h>);(<I>,<i>);(<J>,<j>);\ (<K>,<k>);(<L>,<l>);(<M>,<m>);(<N>,<n>);(<O>,<o>);\ (<P>,<p>);(<Q>,<q>);(<R>,<r>);(<S>,<s>);(<T>,<t>);\ (<U>,<u>);(<V>,<v>);(<W>,<w>);(<X>,<x>);(<Y>,<y>);\ (<Z>,<z>);\ (<A-grave>,<a-grave>);\ (<A-circumflex>,<a-circumflex>);\ (<AE-ligature>,<ae-ligature>);\ (<C-cedilla>,<c-cedilla>);\ (<E-grave>,<e-grave>);\ (<E-acute>,<e-acute>);\ (<E-circumflex>,<e-circumflex>);\ (<E-diaeresis>,<e-diaeresis>);\ (<I-circumflex>,<i-circumflex>);\ (<I-diaeresis>,<i-diaeresis>);\ (<O-circumflex>,<o-circumflex>);\ (<U-grave>,<u-grave>);\ (<U-circumflex>,<u-circumflex>);\ (<U-diaeresis>,<u-diaeresis>) [3]   END LC_CTYPE [4]

  1. Section header [Return to example]

  2. Definition of character class

    These definitions start with a keyword that stands for the character class (also referred to as a property), followed by one or more blank characters, then a list of symbols for all characters in that class. You can substitute the character's encoding for its symbol; however, specifying characters by their encodings diminishes the readability of the locale source file and makes it impossible to use the file with more than one codeset.

    Although not illustrated in the example, you can specify a horizontal elipsis (...) to represent a range of characters. In the string <NUL>;...;<tab>, for example, the ellipsis represents all characters whose encodings are between the character whose symbol is <NUL> and the character whose symbol is <tab>. The symbols and their encodings are specified in the charmap file for the locale.

    Character classes as defined by the X/Open UNIX standard are represented by the following keywords:

    From the application standpoint, there is also the class alnum. This class is rarely defined in a locale because it is always a combination of characters in the alpha and digit classes.

    Unicode (*.UTF-8) locales include character classes as defined by the Unicode standard. See locale(4) for details about character classification for Unicode.

    Certain locales, such as those for Asian languages like Japanese, may define nonstandard character classes. [Return to example]

  3. Definitions of case conversion for letter characters

    These definitions, which begin with the keywords toupper and tolower, list symbols in pairs rather than individually. In the toupper definition shown here, the first symbol in the pair is the symbol for a lowercase letter and the second symbol is the symbol for that letter's uppercase equivalent. This definition determines what a letter is converted to when functions, like towupper() and towlower(), perform case conversion on text data.

    Locales that define nonstandard character classes may define other property conversion definitions that are used by the wctrans() and towctrans() functions.

    [Return to example]

  4. Section trailer [Return to example]

The preceding example does not completely illustrate all the options you can use when defining the LC_CTYPE category. You can:

Applications can use the wctype( ) and iswctype( ) functions to determine and test all character classes (including user-defined ones). Applications can use class-specific functions, such as iswalpha and iswpunct to test the standard character classes.

Note

The LC_CTYPE category of the fr_FR.ISO8859-1@example locale is limited to letter characters in the French language. Some locale developers would define character classes to include characters in all the languages supported by the ISO 8859-1 character set. This practice allows locales for multiple Western European languages to use the same LC_CTYPE source definitions through a copy statement.

Refer to locale(4) for additional rules and restrictions that apply to the LC_CTYPE category definition.

7.2.2    Defining the LC_COLLATE Locale Category

The LC_COLLATE section of a locale source file specifies how characters and strings are collated. Example 7-5 shows part of an LC_COLLATE section.

Example 7-5:  LC_COLLATE Category Definition

LC_COLLATE       [1]
order_start             forward;backward;forward   [2]
<NUL>     [3]
<SOH>
<STX>
<ETX>
<EOT>
<ENQ>
<ACK>
<alert>
<backspace>
<tab>

.
.
.
<APC> [3] <space> <space>;<space>;<space> <exclamation-mark> <exclamation-mark>;<exclamation-mark>;<exclamation-mark> <quotation-mark> <quotation-mark>;<quotation-mark>;<quotation-mark>
.
.
.
<a> <a>;<a>;<a> [3] <A> <a>;<a>;<A> <feminine> <a>;<feminine>;<feminine> <a-acute> <a>;<a-acute>;<a-acute> <A-acute> <a>;<a-acute>;<A-acute> <a-grave> <a>;<a-grave>;<a-grave> <A-grave> <a>;<a-grave>;<A-grave> <a-circumflex> <a>;<a-circumflex>;<a-circumflex> <A-circumflex> <a>;<a-circumflex>;<A-circumflex> <a-ring> <a>;<a-ring>;<a-ring> <A-ring> <a>;<a-ring>;<A-ring> <a-diaeresis> <a>;<a-diaeresis>;<a-diaeresis> <A-diaeresis> <a>;<a-diaeresis>;<A-diaeresis> <a-tilde> <a>;<a-tilde>;<a-tilde> <A-tilde> <a>;<a-tilde>;<A-tilde> <ae-ligature> <a>;<a><e>;<a><e> <AE-ligature> <a>;<a><e>;<A><E> <b> <b>;<b>;<b> <B> <b>;<b>;<B> <c> <c>;<c>;<c> <C> <c>;<c>;<C> <c-cedilla> <c>;<c-cedilla>;<c-cedilla> <C-cedilla> <c>;<c-cedilla>;<C-cedilla>
.
.
.
<z> <z>;<z>;<z> [3] <Z> <z>;<z>;<Z> UNDEFINED [4] order_end [5]   END LC_COLLATE [6]

  1. Section header [Return to example]

  2. An order_start keyword that marks the beginning of a section with statements that assign collating weights to elements

    Following the order_start keyword on the same line are sort directives, separated by semicolons (;) that apply to each sorting pass. Sort directives can include the following keywords.

    The number of sort directives corresponds to the number of weights each collating element is assigned in subsequent statements.

    Each sort directive and its associated set of weights specify information for one pass, or level, of string comparison. The first directive applies when the string comparison operation applies the primary weight, the second when the string comparison operation applies the secondary weight, and so on. The number of levels required to collate strings correctly depends on language and cultural requirements and therefore varies from one locale to another. There is also a level number maximum, associated with the COLL_WEIGHTS_MAX setting in the limits.h and sys/localedef.h files. On Tru64 UNIX systems, you are limited to six collation levels (sort directives).

    The backward directive is used for many languages to ensure that accented characters sort after unaccented characters only if the compared strings are otherwise equivalent.

    The position directive is frequently used to handle characters, such as the hyphen (-) in Western European languages, whose significance can be relative to word position. For example, assume you wanted the word "o-ring" to collate in a word list before the word "or-ing", but do not want the hyphen to be considered until after strings are sorted by letters alone. You would need two sort directives and associated sets of weight specifiers to implement this order. For the first comparison operation, you specify forward as the sort directive, letters as the first weights for all letter characters, and IGNORE as the weight for the hyphen character. For the second, or a later, comparison operation, you specify forward position as the sort directive, IGNORE as the weight for all letter characters, and the hyphen as the weight for the hyphen character.

    If you do not specify a sort directive, the default is forward. [Return to example]

  3. Collation order statements for elements

    These statements specify a character symbol, optionally followed by one or more blank characters (spaces or tabs), then the symbols for characters that have the same weight at each stage of the sort.

    In the example, the sort order is control characters, followed by punctuation and digits, and then letters. Letters are sorted on multiple passes, with diacritics and case ignored on the first pass, diacritics being significant on the second pass, and case being significant on the third pass. [Return to example]

  4. Collation order statement for characters not specified in other collation order statements

    The UNDEFINED keyword begins a collation order statement to be applied to all characters that are defined in the locale's charmap file but not specified in other collation order statements. Characters that fall into the UNDEFINED category are considered in regular expressions to belong to the same equivalence class.

    You should always include the UNDEFINED collation order statement. If this statement is absent, the localedef command includes undefined characters at the end of the collating order and issues a warning. Furthermore, if you place an UNDEFINED statement as the last collation order statement, the localedef command can sometimes compress all undefined characters into one entry. This action can reduce the size of the locale.

    This locale specifies that any characters specified in the locale's charmap file but not handled by other collation order statements be ordered last.

    An UNDEFINED statement can have an operand. For example, the IGNORE keyword causes any characters unspecified by other collation order statements to be ignored for the sort pass in which IGNORE appears. If the following UNDEFINED statement had been included in the example, characters not specified in other collation order statements would be ignored in all sort passes defined by those statements:

    UNDEFINED    IGNORE;IGNORE;IGNORE
    

    [Return to example]

  5. Trailer to indicate the end of collation order statements [Return to example]

  6. Trailer to indicate the end of the LC_COLLATE section [Return to example]

The preceding example shows only a few of the options that you can specify when defining the LC_COLLATE category. You can also use:

Refer to locale(4) for more detailed information on the LC_COLLATE category definition.

7.2.3    Defining the LC_MESSAGES Locale Category

The LC_MESSAGES section of a locale source file defines strings that are valid for affirmative and negative responses from users. Example 7-6 shows an LC_MESSAGES section.

Example 7-6:  LC_MESSAGES Category Definition

LC_MESSAGES    [1]
 
# yes expression. The following designates:
# "^([oO]|[oO][uU][iI])"
 
yesexpr       "<circumflex><left-parenthesis>\
<left-square-bracket><o><O><right-square-bracket>\
<vertical-line><left-square-bracket><o><O>\
<right-square-bracket><left-square-bracket><u><U>\
<right-square-bracket><left-square-bracket><i><I>\
<right-square-bracket><right-parenthesis>"    [2]
 
# no expression. The following designates:
# "^([nN]|[nN][oO][nN])"
 
noexpr        "<circumflex><left-parenthesis>\
<left-square-bracket><n><N><right-square-bracket>\
<vertical-line><left-square-bracket><n><N>\
<right-square-bracket><left-square-bracket><o><O>\
<right-square-bracket><left-square-bracket><n><N>\
<right-square-bracket><right-parenthesis>"    [3]
 
# yes string. The following designates: "oui:o:O"
 
yesstr        "<o><u><i><colon><o><colon><O>"    [4]
 
# no string. The following designates: "non:n:N"
 
nostr         "<n><o><n><colon><n><colon><N>"    [5]
 
END LC_MESSAGES    [6]
 
 

  1. Section header [Return to example]

  2. Definition of an expression for a valid "yes" response

    This entry consists of the yesexpr keyword, followed by one or more spaces or tabs, and an extended regular expression that is delimited by double quotation marks.

    This expression specifies that "oui" or "o" (case is ignored) is a valid affirmative response in this locale. Note that the regular expression for yesexpr specifies individual characters by their symbols as defined in the locale's charmap file. [Return to example]

  3. Definition of an expression for a valid "no" response

    This entry consists of the noexpr keyword, followed by one or more spaces or tabs, and an extended regular expression that is delimited by double quotation marks.

    This expression specifies that "non" or "n" (case is ignored) is a valid affirmative response in this locale. [Return to example]

  4. Definition of a string for a valid "yes" response

    This entry consists of the yesstr keyword, followed one or more spaces or tabs, and a fixed string that is delimited by double quotation marks.

    The yesstr entry is marked as LEGACY in the X/Open UNIX standard and is not included in the POSIX standard; however, some applications and systems software still might use yesstr rather than yesexpr. To ensure that your locale works correctly with such software, you should define yesstr in your locale. Note that the X/Open UNIX standard defines a single fixed string for yesstr. The colon (:) separator, which allows multiple fixed strings to be specified, is an extension to the standard definition. [Return to example]

  5. Definition of a string for a valid "no" response

    This entry consists of the nostr keyword, followed one or more spaces or tabs, and a fixed string that is delimited by double quotation marks.

    The nostr entry is marked as LEGACY in the X/Open UNIX standard and is not included in the POSIX standard; however, some applications and systems software still might use nostr rather than noexpr. To ensure that your locale works correctly with such software, you should define nostr in your locale. Note that the X/Open UNIX standard defines a single fixed string for nostr. The colon (:) separator, which allows multiple fixed strings to be specified, is an extension to the standard definition. [Return to example]

  6. Section trailer [Return to example]

As an alternative to specifying symbol definitions, you can use the copy statement between the section header and trailer to duplicate an existing locale's definition of the LC_MESSAGES category. The copy statement represents a complete definition of the category and cannot be used along with explicit symbol definitions.

7.2.4    Defining the LC_MONETARY Locale Category

The LC_MONETARY section of the locale source file defines the rules and symbols used to format monetary values. Application developers use the localeconv( ) and nl_langinfo( ) functions to determine the information defined in this section and apply formatting rules through the strfmon( ) function. Example 7-7 shows an LC_MONETARY section.

Example 7-7:  LC_MONETARY Category Definition

LC_MONETARY   [1]
 
int_curr_symbol   "<F><R><F><space>"   [2]
currency_symbol   "<F>"    [2]
mon_decimal_point "<comma>"    [2]
mon_thousands_sep ""    [2]
mon_grouping      3;0    [2]
positive_sign     ""    [2]
negative_sign     "<hyphen>"    [2]

.
.
.
END LC_MONETARY [3]

  1. Section header [Return to example]

  2. Symbol definitions

    The entries in the example specify the following:

    [Return to example]

  3. Section trailer [Return to example]

The following list describes the symbol names you can define in the LC_MONETARY section.

As an alternative to specifying symbol definitions, you can use the copy statement between the section header and trailer to duplicate an existing locale's definition of LC_MONETARY. The copy statement represents a complete definition of the category and cannot be used along with explicit symbol definitions.

Refer to locale(4) for complete information about specifying LC_MONETARY symbol definitions.

7.2.5    Defining the LC_NUMERIC Locale Category

The LC_NUMERIC section of the locale source file defines the rules and symbols used to format numeric data. You can use the localeconv( ) and nl_langinfo( ) functions to access this formatting information. Example 7-8 shows an LC_NUMERIC section.

Example 7-8:  LC_NUMERIC Category Definition

LC_NUMERIC    [1]
decimal_point     "<comma>"    [2]
thousands_sep     ""    [3]
grouping          3;0    [4]
 
END LC_NUMERIC    [5]

  1. Category header [Return to example]

  2. Definition of radix character (decimal point) [Return to example]

  3. Definition of character used to separate groups of digits to the left of the radix character. In this locale, no default character is defined. Therefore, applications must supply this character, if needed. [Return to example]

  4. The size of each group of digits to the left of the radix character. The character defined by thousands_sep, if any, is inserted between the groups defined by grouping.

    You can vary the size of groups by specifying multiple digits separated by a semicolon (;). For example, 3;2 specifies that the first group to the left of the radix character contains three digits and all subsequent groups contain 2 digits. On Tru64 UNIX systems, 3;0 and 3 are equivalent; that is, all digits to the left of the radix character are group by threes. [Return to example]

  5. Category trailer [Return to example]

The preceding example shows all of the symbols you can define in the LC_NUMERIC section. In place of any symbol definitions, you can specify a copy statement between the section header and trailer to include this section from another locale.

Refer to locale(4) for detailed rules about symbol definitions.

7.2.6    Defining the LC_TIME Locale Category

The LC_TIME section of a locale source file defines the interpretation of field descriptors supported by the date command. This section also affects the behavior of the strftime( ), wcsftime( ), strptime( ), and nl_langinfo( ) functions. Example 7-9 shows some of the symbols defined for the sample French locale.

Example 7-9:  LC_TIME Category Definition

LC_TIME    [1]
 
abday   "<d><i><m>";\
        "<l><u><n>";\
        "<m><a><r>";\
        "<m><e><r>";\
        "<j><e><u>";\
        "<v><e><n>";\
        "<s><a><m>"    [2]
 
day     "<d><i><m><a><n><c><h><e>";\
        "<l><u><n><d><i>";\
        "<m><a><r><d><i>";\
        "<m><e><r><c><r><e><d><i>";\
        "<j><e><u><d><i>";\
        "<v><e><n><d><r><e><d><i>";\
        "<s><a><m><e><d><i>"    [3]
 
abmon   "<j><a><n>";\
        "<f><e-acute><v>";\
        "<m><a><r>";\
        "<a><v><r>";\
        "<m><a><i>";\
        "<j><u><n>";\
        "<j><u><l>";\
        "<a><o><u-circumflex>";\
        "<s><e><p>";\
        "<o><c><t>";\
        "<n><o><v>";\
        "<d><e-acute><c>"    [4]
 
mon     "<j><a><n><v><i><e><r>";\
        "<f><e-acute><v><r><i><e><r>";\
        "<m><a><r><s>";\
        "<a><v><r><i><l>";\
        "<m><a><i>";\
        "<j><u><i><n>";\
        "<j><u><i><l><l><e><t>";\
        "<a><o><u-circumflex><t>";\
        "<s><e><p><t><e><m><b><r><e>";\
        "<o><c><t><o><b><r><e>";\
        "<n><o><v><e><m><b><r><e>";\
        "<d><e-acute><c><e><m><b><r><e>"    [5]
 
# date/time format. The following designates this
# format: "%a %e %b %H:%M:%S %Z %Y"
 
d_t_fmt "<percent-sign><a><space><percent-sign><e>\
<space><percent-sign><b><space><percent-sign><H>\
<colon><percent-sign><M><colon><percent-sign><S>\
<space><percent-sign><Z><space><percent-sign><Y>"    [6]

.
.
.
END LC_TIME [7]

  1. Section header [Return to example]

  2. Abbreviated names for days of the week

    Use the %a conversion specifier to include these strings in formats. [Return to example]

  3. Full names for days of the week

    Use the %A conversion specifier to include these strings in formats. [Return to example]

  4. Abbreviated names for months of the year

    Use the %b conversion specifier to include these strings in formats. [Return to example]

  5. Full names for months of the year

    Use the %B conversion specifier to include these strings in formats. [Return to example]

  6. Format for combined date and time information

    The format combines field descriptors as defined for the strftime() function. See strftime(3) for a complete list of field descriptors.

    The specified format includes the field descriptors for the abbreviated day of the week (%a), the day of the month (%e), the number of hours in a 24-hour period (%H), the number of minutes (%M), and the number of seconds (%S), the time zone (%Z), and the full representation of the year (%Y). If the date were April 23, 1999 on the East coast of the United States, the format specified in this example would cause the date command to display ven 23 avr 13:43:05 EDT 1999. [Return to example]

  7. Section trailer [Return to example]

The preceding example includes only some of the symbol definitions that are standard for the LC_TIME category. The following definitions are also standard:

As is true for other category sections, you can specify a copy statement to include all LC_TIME definitions from another locale. Note that Tru64 UNIX supports symbols and field descriptors in addition to those described here. Refer to locale(4) for more complete information.

7.3    Building Libraries to Convert Multibyte/Wide-Character Encodings

C library routines rely on a set of special interfaces to convert characters to and from data file encoding and wide-character encoding (internal process code). By default, the C library routines use interfaces that handle only single-byte characters. However, many are defined with entry points that permit use of alternative interfaces for handling multibyte-characters. The interfaces that can be tailored to a locale's codeset are called methods.

Only locales with multibyte codesets must use methods. When a locale uses methods, there are some methods that the locale must supply and other methods that it can optionally supply. A method is required when the corresponding interface is converting characters between data formats and needs codeset-specific logic to do that operation correctly. A method is optional when the corresponding interface is working with data after it has been converted to wide-character format and can apply logic that is valid for both single-byte and multibyte characters.

Methods must be available on the system in a shareable library. This library and the functions that implement each method in the library are made known to the localedef command through a methods file. When the localedef command processes the methods file along with the charmap and locale source files, the resulting locale includes pointers to all methods that are supplied with the locale, along with pointers to default implementations for optional methods that are not supplied with the locale. When you set the LANG variable to the newly built locale and run a command or application, methods are used wherever they have been enabled in the system software.

7.3.1    Required Methods

If your locale uses methods, it must supply the following methods, without which it is impossible for C Library functions to convert data between multibyte and wide-character formats:

7.3.1.1    Writing the _ _mbstopcs Method for the fgetws Function

The fgetws( ) function uses the _ _mbstopcs method to convert the bytes in the standard I/O (stdio) buffer to a wide-character string. The function that implements this method must return the number of wide characters converted by the call.

This method is similar to the one for mbstowcs (see Section 7.3.1.6) but contains additional parameters to meet the needs of fgetws( ). By convention, a C source file for this method has the file name _ _mbstopcs_codeset .c, where codeset identifies the codeset for which the method is tailored. Example 7-10 shows the file _ _mbstopcs_sdeckanji.c that defines the _ _mbstopcs method used with the ja_JP.sdeckanji locale.

Example 7-10:  The _ _mbstopcs_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>  [1]
#include <wchar.h>   [1]
#include <sys/localedef.h>   [1]
 
int _ _mbstopcs_sdeckanji(
        wchar_t *pwcs,   [2]
        size_t pwcs_len,   [3]
        const char *s,   [4]
        size_t s_len,   [5]
        int stopchr,   [6]
        char **endptr,   [7]
        int *err,   [8]
        _LC_charmap_t *handle )   [9]
{
    int cnt = 0;   [10]
    int pwcs_cnt = 0;   [10]
    int s_cnt = 0;   [10]
 
    *err = 0;   [11]
 
    while (1) {   [12]
        if (pwcs_cnt >= pwcs_len || s_cnt >= s_len) {
            *endptr = (char *)&(s[s_cnt]);
            break;
        }   [13]
        if ((cnt = _ _mbtopc_sdeckanji(&(pwcs[pwcs_cnt]),
            &(s[s_cnt]), (s_len - s_cnt), err)) == 0) {
            *endptr = (char *)&(s[s_cnt]);
            break;
        }   [14]
        pwcs_cnt++;   [15]
        if ( s[s_cnt] == (char) stopchr) {
            *endptr = (char *)&(s[s_cnt+1]);
            break;
        }   [16]
        s_cnt += cnt;   [17]
    }   [18]
    return (pwcs_cnt);   [19]
}

  1. Include header files that contain constants and structures required for this method. [Return to example]

  2. Points, through pwcs, to a buffer that stores the wide-character string. [Return to example]

  3. Defines a variable, pwcs_len, to store the size of the pwcs buffer. [Return to example]

  4. Points, through s, to a buffer that stores the multibyte-character string being converted. [Return to example]

  5. Defines a variable, s_len, to store the number of bytes of data in the s buffer.

    This parameter is needed because the fgetws( ) function reads from the standard I/O buffer, which does not contain null-terminated strings. [Return to example]

  6. Defines a variable, stopchr, to contain a byte value that would force conversion to stop.

    This value, typically \n, is passed to the method on the call from the fgetws( ) function, which handles only one line of input per call. [Return to example]

  7. Defines a variable, endptr, that points to the byte following the last byte converted.

    This pointer is needed to specify the starting character in the standard I/O buffer for the next call to fgetws( ). [Return to example]

  8. Points, through err, to a variable that stores execution status for the call made by this method to the mbtopc method. [Return to example]

  9. Points, through hdl, to a structure that points to the methods that parse character maps for this locale.

    The localedef command creates and stores values in the _LC_charmap_t structure. [Return to example]

  10. Initialize variables that indicate the number of bytes that a character uses in multibyte format (supplied by the mbtopc method) and the byte or character position in buffers that the fgetws( ) function uses. [Return to example]

  11. Sets err to zero (0) to indicate success. [Return to example]

  12. Starts the while loop that converts the multibyte string. [Return to example]

  13. Sets endptr and breaks out of the loop when there is either no more space in the buffer that stores wide-character data or no more data in the buffer that stores multibyte data. [Return to example]

  14. Calls the mbtopc method to convert a character from multibyte format to wide-character format; breaks out of the loop and sets endptr to the first byte of the character that could not be converted if the mbtopc method fails to convert a character and returns an error.

    The err variable contains the return status of the call to the mbtopc method:

    [Return to example]

  15. Increments the character position in the buffer that stores the wide-character data. [Return to example]

  16. Sets endptr to the character following the character stored in stopchr if the stopchr character is encountered in the multibyte data. [Return to example]

  17. Increments the byte position in the buffer that contains multibyte data. [Return to example]

  18. Ends the while loop. [Return to example]

  19. Returns the number of characters in the buffer that contains wide-character data. [Return to example]

7.3.1.2    Writing the _ _mbtopc Method for the getwc( ) Function

The getwc( ) or fgetwc( ) function calls the _ _mbtopc method to convert a multibyte character to a wide character. The method returns the number of bytes in the multibyte character that is converted. This method is similar to the one for mbtowc (see Section 7.3.1.7) but contains an additional parameter that getwc( ) needs. By convention, a C source file for this method has the file name _ _mbtopc_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-11 shows the _ _mbtopc_sdeckanji.c file, which defines the _ _mbtopc method used with the ja_JP.sdeckanji locale.

Example 7-11:  The _ _mbtopc_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>  [1]
#include <wchar.h>   
#include <sys/localedef.h>   
 
/*
The algorithm for this conversion is:
s[0] < 0x9f:  PC = s[0]
s[0] = 0x8e:  PC = s[1] + 0x5f;
s[0] = 0x8f   PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
            0x21 < s[1] < 0x7e
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   [2]
int  _ _mbtopc_sdeckanji(
        wchar_t *pwc,   [3]
        char *ts,   [4]
        size_t maxlen,   [5]
        int *err,   [6]
        _LC_charmap_t *handle )   [7]
{
    wchar_t dummy;   [8]
    unsigned char *s = (unsigned char *)ts;   [9]
    if (s == NULL)
        return(0);   [10]
    if (pwc == (wchar_t *)NULL)
        pwc = &dummy;   [11]
    *err = 0;   [12]
    if (s[0] <= 0x8d) {
        if (maxlen < 1) {
            *err = 1;
            return(0);
        }
        else {
            *pwc = (wchar_t) s[0];
            return(1);
        }
    }   [13]
    else if (s[0] == 0x8e) {
        if (maxlen >= 2) {
            if (s[1] >=0xa1 && s[1] <=0xfe) {
                *pwc = (wchar_t) (s[1] + 0x5f);
                return(2);
            }
        }
        else {
            *err = 2;
            return(0);
        }
    }   [14]
    else if (s[0] == 0x8f) {
        if (maxlen >= 3) {
            if ((s[1] >=0xa1 && s[1] <=0xfe) &&
                (s[2] >=0xa1 && s[2] <= 0xfe)) {
                *pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
                       (wchar_t) (s[2] - 0xa1)) + 0x303c;
                return(3);
            }
        }
        else {
            *err = 3;
            return(0);
        }
    }   [15]
 
    else if (s[0] <= 0x9f) {
        if (maxlen < 1) {
            *err = 1;
            return(0);
        }
        else {
            *pwc = (wchar_t) s[0];
            return(1);
        }
 
    }   [16]
    else if (s[0] >= 0xa1 && s[0] <= 0xfe) {
        if (maxlen >= 2) {
            if  (s[1] >=0xa1 && s[1] <= 0xfe) {
                *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                       (wchar_t) (s[1] - 0xa1)) + 0x15e;
                return(2);
            } else if  (s[1] >=0x21 && s[1] <= 0x7e) {
                *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                       (wchar_t) (s[1] - 0x21)) + 0x5f1a;
                return(2);
            }
        }
        else {
            *err = 2;
            return(0);
        }
 
    }   [17]
    *err = -1;
    return(0);   [18]
}

  1. Include header files that contain constants and structures required for this method [Return to example]

  2. Describes the algorithm used to determine the number of bytes and valid byte combinations for the different character sets that the codeset supports

    The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]

  3. Points, through pwc, to a buffer that stores the wide character [Return to example]

  4. Points, through ts, to a buffer that stores the bytes that are passed to the method from the calling function [Return to example]

  5. Declares a variable, maxlen, that stores the maximum number of bytes in the multibyte data

    This value is passed by the calling function. [Return to example]

  6. Points, through err, to a buffer that stores execution status [Return to example]

  7. Points, through handle, to a structure that contains pointers to the methods that parse the character maps for this locale [Return to example]

  8. Declares a variable, dummy, to which pwc can be set to ensure a valid address [Return to example]

  9. Casts ts (an array of signed characters) to s (an array of unsigned characters)

    This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as char, to a large signed data type, such as int. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following is evaluated as true when you expect it to be false:

    if (s[0] <= 0x8d [Return to example]

  10. Returns zero (0) if the s buffer contains or points to NULL [Return to example]

  11. Stores the contents of dummy in the wide-character buffer if the ts buffer contains or points to NULL

    This operation ensures that *pwc always points to a valid address; otherwise, an application could produce a segmentation fault by referring to this pointer when a wide character has not been stored in pwc. [Return to example]

  12. Initializes err to zero (0) to indicate success [Return to example]

  13. Determines if the character is one of the single-byte characters that the codeset defines for values equal to or less than 0x8d

    If s contains no characters, returns zero (0) to indicate that no bytes were converted and sets err to 1 to indicate that 1 byte is needed to form a valid character.

    If the byte value is in the range being tested, moves the associated process code value to pwc and returns 1 to indicate the number of bytes converted. [Return to example]

  14. Determines if the character is one of the double-byte characters that the codeset defines for the value 0x8e (first byte) and the value range 0xa1 to 0xfe (second byte)

    If yes, moves the associated process code value to the pwc buffer and returns 2 to indicate the number of bytes converted; otherwise, returns 0 to indicate that no conversion took place and sets err to 2 to specify that at least 2 bytes are needed to form a valid character. [Return to example]

  15. Determines if the character is one of the triple-byte characters that the codeset defines for the value 0x8f (first byte), the range 0xa1 to 0xfe (second byte), and the range 0xa1 to 0xfe (third byte)

    If yes, moves the associated process code value to pwc and returns 3 to indicate the number of bytes converted; otherwise, sets err to 3 to indicate that at least 3 bytes are needed and returns zero (0) to indicate that no character was converted. [Return to example]

  16. Determines if the character is one of the single-byte characters that the codeset defines for the range 0x90 to 0x9f

    If there are no bytes in the standard I/O buffer, returns zero (0) to indicate that no bytes were converted and sets err to 1 to indicate that at least 1 byte is needed to form a valid character.

    If the byte value is in the defined range, moves the associated process code value to pwc and returns 1 to indicate the number of bytes converted. [Return to example]

  17. Determines if the character is one of the double-byte characters that the codeset defines for the range 0xa1 to 0xfe (first byte) and 0x21 to 0x7e (second byte)

    If yes, moves the associated process code value to pwc buffer and returns 2 to indicate the number of bytes converted; otherwise, sets err to 2 to indicate that at least 2 bytes are needed to form a valid character and returns zero (0) to indicate that no bytes were converted. [Return to example]

  18. Sets err to -1 to indicate that an invalid multibyte sequence was encountered and returns zero (0) to indicate that no bytes were converted

    These statements execute if the multibyte data in s satisfies none of the preceding if conditions. [Return to example]

7.3.1.3    Writing the _ _pcstombs Method for the fputws( ) Function

The fputws( ) function first calls the _ _pcstombs method to convert a string of characters from process (wide-character) code to multibyte code. If this method returns -1 to indicate no support by the locale, fputws( ) then calls putwc( ) for each wide character in the string being converted. By convention, a C source file for this method has the file name _ _pcstombs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-12 shows the file _ _pcstombs_sdeckanji.c that defines the _ _pcstombs method used with the ja_JP.sdeckanji locale.

Example 7-12:  The _ _pcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale

int _ _pcstombs_sdeckanji()
{
        return -1;   [1]
}

  1. Returns -1 to indicate that the locale does not support the method.

    This return causes the fputws( ) function to use multiple calls to putwc( ) to convert wide characters in the string. [Return to example]

If you choose to implement this method fully rather than writing it to return -1, your function implementation returns the number of wide characters converted and must include header files and parameters as shown in the following example:

#include <stdlib.h>
#include <wchar.h>
#include <sys/localedef.h>
 
int _ _pcstombs_newcodeset(
        wchar_t *pcsbuf,   [1]
        size_t pcsbuf_len,   [2]
        char *mbsbuf,  [3]
        size_t mbsbuf_len,  [4]
        char **endptr,  [5]
        int *err,   [6]
        _LC_charmap_t *handle )  [7]

  1. Specifies a pointer to a buffer that contains the wide-character string [Return to example]

  2. Specifies a variable with the length of the wide-character buffer

    This value is passed to the method on the call from fputws( ). [Return to example]

  3. Specifies a pointer to a buffer that contains the multibyte-character string [Return to example]

  4. Specifies a variable with the length of the multibyte-character buffer

    This value is passed to the method on the call from fputws( ). [Return to example]

  5. Points, through endptr, to a pointer to the byte position in the multibyte-character buffer where the next character would begin if multiple calls to fputws( ) are required to convert all the wide-character data [Return to example]

  6. Specifies a pointer to the execution status return

    If this method calls the wctomb method to perform the character conversion, the wctomb method sets this status. Otherwise, this method must incorporate the logic to perform wide-character to multibyte-character conversion and set the status directly.

    In any event, the fputws( ) function expects the following values:

    [Return to example]

  7. Specifies a pointer to the _LC_charmap_t structure that stores pointers to the methods used with this locale [Return to example]

The _ _pcstombs method performs the reverse of the operation that the _ _mbstopcs method described in Section 7.3.1.3 performs. Because of the direction of the data conversion, the _ _pcstombs method:

7.3.1.4    Writing a _ _pctomb Method

C Library functions currently do not use the _ _pctomb interface. The putwc( ) function, for example, calls the wctomb method to convert a character from wide-character to multibyte-character format. Nonetheless, the localedef command requires a method for this function when your locale supplies methods. By convention, a C source file for this method has the file name _ _pctomb_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-13 shows the _ _pctomb_sdeckanji.c file that defines the _ _pctomb method used with the ja_JP.sdeckanji locale.

Example 7-13:  The _ _pctomb_sdeckanji Method for the ja_JP.sdeckanji Locale

int _ _pctomb_sdeckanji()
{
        return -1;   [1]
}

  1. Returns -1 to indicate that the locale does not support this method [Return to example]

7.3.1.5    Writing a Method for the mblen( ) Function

The mblen( ) function uses the mblen method to return the number of bytes in a multibyte character. By convention, a C source file for this method has the file name _ _mblen_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-14 shows the _ _mblen_sdeckanji.c file that defines the mblen method used with the ja_JP.sdeckanji locale.

Example 7-14:  The _ _mblen_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <sys/errno.h>   
#include <sys/localedef.h>   
 
/*
The algorithm for this conversion is:
 
s[0] < 0x9f:  1 byte
s[0] = 0x8e:  2 bytes
s[0] = 0x8f   3 bytes
s[0] > 0xa1   2 bytes
 
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   [2]
 
int _ _mblen_sdeckanji(
        char *fs,   [3]
        size_t maxlen,   [4]
        _LC_charmap_t *handle )   [5]
{
    const unsigned char *s = (void *) fs;   [6]    if (s == NULL || *s == '\0')
        return(0);   [7]
 
    if (maxlen < 1) {
        _Seterrno(EILSEQ);
        return((size_t)-1);
    }   [8]    if (s[0] <= 0x8d)
        return(1);   [9]
 
    else if (s[0] == 0x8e) {
        if (maxlen >= 2 && s[1] >=0xa1 && s[1] <=0xfe)
            return(2);
    }   [10]
 
    else if (s[0] == 0x8f) {
        if(maxlen >=3 && (s[1] >=0xa1 && s[1] <=0xfe) &&
            (s[2] >=0xa1 && s[2] <= 0xfe))
            return(3);
    }   [11]
 
    else if (s[0] <= 0x9f)
        return(1);   [12]
 
    else if (s[0] >= 0xa1) {
            if (maxlen >=2 && (s[0] <= 0xfe) )
                    if ( (s[1] >=0xa1 && s[1] <= 0xfe) ||
                       (s[1] >=0x21 && s[1] <= 0x7e) )
                        return(2);
    }   [13]
 
    _Seterrno(EILSEQ);
    return((size_t)-1);   [14]
}

  1. Includes header files that contain constants and structures required by this method [Return to example]

  2. Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence

    The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]

  3. Points, through fs, to a buffer that stores the byte string to be examined [Return to example]

  4. Defines a variable, maxlen, that stores the maximum length of a multibyte character

    This value is passed to the method by the mblen( ) function. [Return to example]

  5. Points, through handle, to a structure that stores pointers to the methods that parse character maps for this locale [Return to example]

  6. Casts fs (an array of signed characters) to s (an array of unsigned characters).

    This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as char, to a large signed data type, such as int. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following is evaluated as true when you expect it to be false:

    if (s[0] <= 0x8d [Return to example]

  7. Returns zero (0) to indicate that the character length is zero (0) bytes if s contains or points to NULL [Return to example]

  8. Returns -1 and sets errno to [EILSEQ] (invalid character sequence) if maxlen (the maximum number of bytes to consider) is 0 or a negative number

    To set errno in a way that works correctly with multithreaded applications, use _Seterrno rather than an assignment statement. [Return to example]

  9. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d

    If yes, returns 1 to indicate that the character length is 1 byte. [Return to example]

  10. Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe

    If yes, returns 2 to indicate that the character length is 2 bytes. [Return to example]

  11. Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe

    If yes, returns 3 to indicate that the character length is 3 bytes. [Return to example]

  12. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f

    If yes, returns 1 to indicate that the character length is 1 byte. [Return to example]

  13. Determines if the first byte identifies a double-byte character whose first byte contains a value in the range 0xa1 to 0xfe and whose second byte contains a value in the range 0x21 to 0x7e

    If yes, returns 2 to indicate that the character length is 2 bytes. [Return to example]

  14. Returns -1 and sets errno to [EILSEQ] to indicate an invalid multibyte sequence

    These statements execute if the multibyte data in the standard I/O buffer satisfies none of the preceding if conditions. [Return to example]

7.3.1.6    Writing a Method for the mbstowcs( ) Function

The mbstowcs( ) function uses the mbstowcs method to convert a multibyte character string to process wide-character code and to return the number of resultant wide characters. By convention, a C source file for this method has the file name _ _mbstowcs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-15 shows the _ _mbstowcs_sdeckanji.c file that defines the mbstowcs method used with the ja_JP.sdeckanji locale.

Example 7-15:  The _ _mbstowcs_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <sys/localedef.h>  
 
size_t _ _mbstowcs_sdeckanji(
        wchar_t *pwcs,   [2]
        const char *s,   [3]
        size_t n,   [4]
        _LC_charmap_t *handle )   [5]
{
    int len = n;   [6]
    int rc;   [7]
    int cnt;   [8]
    wchar_t *pwcs0 = pwcs;   [9]
    int mb_cur_max;   [10]
 
    if (s == NULL)
        return (0);   [11]
 
    mb_cur_max = MB_CUR_MAX;   [12]
 
    if (pwcs == (wchar_t *)NULL) {
        cnt = 0;
        while (*s != '\0') {
             if ((rc = _ _mblen_sdeckanji(s, mb_cur_max, handle)) == -1)
                return(-1);
             cnt++  ;
             s += rc;
        }
        return(cnt);
    }   [13]
 
    while (len-- > 0) {
        if ( *s == '\0') {
            *pwcs = (wchar_t) '\0';
            return (pwcs - pwcs0);
        }
        if ((cnt = _ _mbtowc_sdeckanji(pwcs, s, mb_cur_max, handle)) < 0)
            return(-1);
        s += cnt;
        ++pwcs;
    }   [14]
 
    return (n);   [15]
}

  1. Includes header files that contain constants and structures required for this method [Return to example]

  2. Points, through pwcs, to a buffer that contains the wide-character string [Return to example]

  3. Points, through s, to a buffer that contains the multibyte-character string [Return to example]

  4. Defines a variable, n, that contains the number of wide characters in pwcs [Return to example]

  5. Points, through handle, to a structure that stores pointers to the methods that parse character maps for this locale [Return to example]

  6. Assigns the number of wide characters in the pwcs buffer (the n value supplied by the calling function) to len [Return to example]

  7. Defines a variable, rc, that stores the return count from a call this method makes to the mblen function [Return to example]

  8. Defines a variable, cnt, that counts the bytes used by characters in the s buffer [Return to example]

  9. Saves the start of the wide-character string passed by the calling function in the pwcs0 variable [Return to example]

  10. Defines a variable, mb_cur_max, that is later set to MB_CUR_MAX and used in a call to the mblen method [Return to example]

  11. Returns zero (0) if s is null

    A method should return zero (0) if the locale's character encoding is stateless and a nonzero value if the locales's character encoding is stateful. [Return to example]

  12. Assigns the value defined for MB_CUR_MAX to mb_cur_max for use on the following call to the mblen method [Return to example]

  13. Checks to see if a null pointer was passed from the calling function and, if yes, calls the mblen method to calculate the size of the wide-character string

    The programmer can request the size of the pwcs buffer (for memory allocation purposes) by passing a null wide character as the pwcs parameter in the call to mbstowcs( ). The programmer can then use the return value to efficiently allocate memory space for the application's wide-character buffer before calling mbstowcs( ) again to actually convert the multibyte string. [Return to example]

  14. Converts bytes in the multibyte-character buffer by calling the _ _mbtowc method until a null character (end-of-string) is encountered

    Stops processing and returns the number of wide characters in the pwcs buffer if a NULL character is encountered; increments the byte position in the multibyte character buffer by an appropriate number each time a character is successfully converted

    This while loop uses the condition len-- > 0 to ensure that processing stops when the pwcs buffer is full. The first if condition in the loop makes sure that, if the multibyte string in the s buffer is null terminated, the associated null terminator in the pwcs buffer is not included in the wide-character count that the mbtowcs( ) function returns to the application. [Return to example]

  15. Returns the value in n to indicate the resultant number of wide characters in the pwcs buffer

    This statement executes if the pwcs buffer runs out of space before a NULL is encountered in the s buffer. [Return to example]

7.3.1.7    Writing a Method for the mbtowc( ) Function

The mbtowc() function uses the mbtowc method to convert a multibyte character to a wide character and to return the number of bytes in the multibyte character that was converted. By convention, a C source file for this method has the file name _ _mbtowc_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-16 shows the _ _mbtowc_sdeckanji.c file that defines the mbtowc method used with the ja_JP.sdeckanji locale.

Example 7-16:  The _ _mbtowc_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <sys/errno.h>   
#include <sys/localedef.h>   
 
/*
The algorithm for this conversion is:
 
s[0] < 0x9f:  PC = s[0]
s[0] = 0x8e:  PC = s[1] + 0x5f;
s[0] = 0x8f   PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
0x21 < s[1] < 0x7e
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
 
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   [2]
int _ _mbtowc_sdeckanji(
        wchar_t *pwc,   [3]
        const char *ts,   [4]
        size_t maxlen,   [5]
        _LC_charmap_t *handle )   [6]
{
    unsigned char *s = (unsigned char *)ts;   [7]
    wchar_t dummy;   [8]
 
    if (s == NULL)
        return(0);   [9]
 
    if (maxlen < 1) {
        _Seterrno(EILSEQ);
        return((size_t)-1);
    }   [10]
 
    if (pwc == (wchar_t *)NULL)
        pwc = &dummy;   [11]
 
    if (s[0] <= 0x8d) {
        *pwc = (wchar_t) s[0];
        if (s[0] != '\0')
            return(1);
        else
            return(0);
    }   [12]
 
    else if (s[0] == 0x8e) {
        if ( (maxlen >= 2) && ((s[1] >=0xa1) && (s[1] <=0xfe))) {
            *pwc = (wchar_t) (s[1] + 0x5f); /* 0x100 - 0xa1 */
            return(2);
        }
    }   [13]
 
    else if (s[0] == 0x8f) {
        if((maxlen >= 3) && (((s[1] >=0xa1) && (s[1] <=0xfe))
           && ((s[2] >=0xa1) && (s[2] <= 0xfe)))) {
                *pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
                   (wchar_t) (s[2] - 0xa1)) + 0x303c;
           return(3);
        }
    }   [14]
 
    else if (s[0] <= 0x9f) {
        *pwc = (wchar_t) s[0];
        if (s[0] != '\0')
            return(1);
        else
            return(0);
    }   [15]
 
    else if (((s[0] >= 0xa1) && (s[0] <= 0xfe)) && (maxlen >= 2)){
            if (((s[1] >=0xa1) && (s[1] <= 0xfe))){
                    *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                              (wchar_t)(s[1] - 0xa1)) + 0x15e;
                    return(2);
            } else if (((s[1] >=0x21) && (s[1] <= 0x7e))){
                    *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                              (wchar_t)(s[1] - 0x21)) + 0x5f1a;
                    return(2);
            }
    }   [16]
    _Seterrno(EILSEQ);
    return(-1);   [17]
}

  1. Includes header files that contain constants and structures required for this method [Return to example]

  2. Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence

    The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]

  3. Points, through pwc, to a buffer that contains the wide character [Return to example]

  4. Points, through ts, to a buffer that contains values in multibyte-character format [Return to example]

  5. Defines a variable, maxlen, that stores the maximum length of a multibyte character

    This value is passed from the calling function; the value will have been set to MB_CUR_MAX on the original call made by the application programmer. [Return to example]

  6. Points, through handle, to a structure that stores pointers to the methods that parse character maps for this locale [Return to example]

  7. Casts ts (an array of signed characters) to s (an array of unsigned characters)

    This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as char, to a large signed data type, such as int. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:

    if (s[0] <= 0x8d [Return to example]

  8. Defines a variable, dummy, that can be assigned to pwc to ensure pwc points to a valid address [Return to example]

  9. Returns zero (0) to indicate that the locale's character encoding is stateless if s contains or points to NULL

    If passed a null pointer, this method should return a value to indicate whether the locale's character encoding is stateful or stateless. Return a nonzero value if your locale's character encoding is stateful. [Return to example]

  10. Returns -1 cast to size_t and sets errno to [EILSEQ] (invalid byte sequence) if the multibyte data buffer is less than 1 byte in length [Return to example]

  11. Stores the contents of dummy in the wide-character buffer if the ts buffer contains or points to NULL

    This operation ensures that pwc always points to a valid address; otherwise, an application could produce a segmentation fault by referring to this pointer when a wide character has not been stored in pwc. [Return to example]

  12. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d

    If yes, stores the associated process code value in the pwc buffer and returns 1 to indicate that the character length is 1 byte [Return to example]

  13. Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe

    If yes, stores the associated process code value in the pwc buffer and returns 2 to indicate that the character length is 2 bytes [Return to example]

  14. Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe

    If yes, stores the associated process code value in the pwc buffer and returns 3 to indicate that the character length is 3 bytes [Return to example]

  15. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f

    If yes, stores the associated process code value in the pwc buffer and returns 1 to indicate that the character length is 1 byte [Return to example]

  16. Determines if the first byte identifies a double-byte character whose first byte contains a value in the range x0a1 to x0fe and whose second byte contains a value in the range 0x21 to 0x7e

    If yes, stores the associated process code value in the pwc buffer and returns 2 to indicate that the character length is 2 bytes [Return to example]

  17. Returns -1 and sets errno to [EILSEQ] to indicate that an invalid multibyte sequence was encountered

    These statements execute if the multibyte data in the s buffer satisfies none of the preceding if conditions. [Return to example]

7.3.1.8    Writing a Method for the wcstombs( ) Function

The wcstombs( ) function calls the wcstombs method to convert a wide-character string to a multibyte-character string and to return the number of bytes in the resultant multibyte-character string. By convention, a C source file for this method has the file name _ _wcstombs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-17 shows the _ _wcstombs_sdeckanji.c file that defines the wcstombs method used with the ja_JP.sdeckanji locale.

Example 7-17:  The _ _wcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <limits.h>   
#include <sys/localedef.h>   
 
size_t _ _wcstombs_sdeckanji(
        char *s,   [2]
        const wchar_t *pwcs,   [3]
        size_t n,   [4]
        _LC_charmap_t *handle )   [5]
{
    int cnt=0;   [6]
    int len=0;   [7]
    int i=0;   [8]
    char tmps[MB_LEN_MAX+1];   [9]
 
    if ( s == (char *)NULL) {
        cnt = 0;
        while (*pwcs != (wchar_t)'\0') {
            if ((len = _ _wctomb_sdeckanji(tmps, *pwcs)) == -1)
                    return(-1);
            cnt += len;
            pwcs++;
        }
        return(cnt);
    }   [10]
 
    if (*pwcs == (wchar_t)'\0') {
        *s = '\0';
        return(0);
    }   [11]
 
    while (1) {   [12]
 
        if ((len = _ _wctomb_sdeckanji(tmps, *pwcs)) == -1)
            return(-1);   [13]
 
        else if (cnt+len > n) {
            *s = '\0';
            break;
        }   [14]
 
        if (tmps[0] == '\0') {
            *s = '\0';
            break;
        }   [15]
 
        for (i=0; i<len; i++) {
            *s = tmps[i];
            s++;
        }   [16]
 
        cnt += len;   [17]
 
        if (cnt == n)
            break;   [18]
 
        pwcs++;   [19]
    }   [20]
 
    if (cnt == 0)
        cnt = len;   [21]
    return (cnt);   [22]
}

  1. Includes header files that contain constants and structures required for this method [Return to example]

  2. Points, through s, to a buffer that stores the multibyte-character string that this method passes to the calling function [Return to example]

  3. Points, through pwcs, to a buffer that stores the wide-character string that is being converted [Return to example]

  4. Defines a variable, n, that stores the number of maximum number of bytes in the multibyte-character string buffer

    This value is supplied by the calling function. [Return to example]

  5. Points, through handle, to a structure that points to the methods that parse character maps for this locale [Return to example]

  6. Initializes a variable, cnt, that is incremented by the number of bytes (len) of each converted character [Return to example]

  7. Initializes a variable, len, that stores the length of each converted character [Return to example]

  8. Initializes a variable, i, that is used to index the bytes in each multibyte character when moving a converted character from temporary storage to s [Return to example]

  9. Defines a temporary buffer, tmps, that stores the multibyte character returned to this method from a call to the wctomb method [Return to example]

  10. Checks to see if a NULL was passed from the calling function in the s buffer

    If yes, calls the wctomb method to calculate the number of bytes required for converted characters (excluding the null terminator) in the multibyte-character buffer

    The programmer can request the size of the s buffer (for memory allocation purposes) by passing a null byte as the data in the s parameter on the call to wcstombs( ). The programmer can then use the return value to efficiently allocate memory space for the application's wide-character buffer before calling wcstombs( ) again to actually convert the wide-character string. [Return to example]

  11. Returns zero (0) to indicate that no multibyte characters resulted and sets s to NULL if pwcs points to NULL [Return to example]

  12. Starts a while loop to process characters in the wide-character string [Return to example]

  13. Converts characters in the wide-character buffer by calling the wctomb method; returns -1 to indicate an invalid character if wctomb returns -1 [Return to example]

  14. Terminates s with NULL and breaks out of the while loop if there is no room in s for the character just converted by wctomb [Return to example]

  15. Moves a null terminator to s and breaks out of the while loop when a NULL is encountered in s [Return to example]

  16. Appends each byte in tmps to s if the current wide character is not a null [Return to example]

  17. Increments cnt by the number of bytes (len) occupied by this character in multibyte format [Return to example]

  18. Breaks out of the while loop without adding a null terminator if the number of bytes processed equals n (the maximum number of bytes in s) [Return to example]

  19. Increments pwcs to point to the next wide character to be converted [Return to example]

  20. Ends the while loop that converts each wide character [Return to example]

  21. Ensures that zero (0) is returned if s does not contain enough space for even one character [Return to example]

  22. Returns the number of bytes in the resultant multibyte-character string [Return to example]

7.3.1.9    Writing a Method for the wctomb( ) Function

The wctomb( ) function calls the wctomb method to convert a wide character to a multibyte character and to return the number of bytes in the resultant multibyte character. By convention, a C source file for this method has the file name _ _wctomb_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-18 shows the _ _wctomb_sdeckanji.c file that defines the wctomb method for the ja_JP.sdeckanji locale.

Example 7-18:  The _ _wctomb_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <sys/errno.h>   
#include <sys/localedef.h>   
 
/*
  The algorithm for this conversion is:
 
PC <= 0x009f:                 s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
                              s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
                              s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
                              s[1] = ((PC - 0x303c) >> 7) + 0x00a1
                              s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7  s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
                              s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
 
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   [2]
 
int _ _wctomb_sdeckanji(
        char *s,    [3]
        wchar_t wc,    [4]
        _LC_charmap_t *handle )    [5]
{
    if (s == (char *)NULL)
        return(0);    [6]
 
    if (wc <= 0x9f) {
        s[0] = (char) wc;
        return(1);
    }    [7]
 
    else if ((wc >= 0x0100) && (wc <= 0x015d)) {
        s[0] = 0x8e;
        s[1] = wc - 0x5f;
        return(2);
    }    [8]
 
    else if ((wc >=0x015e) && (wc <= 0x303b)) {
        s[0] = (char) (((wc - 0x015e) >> 7) + 0x00a1);
        s[1] = (char) (((wc - 0x015e) & 0x007f) + 0x00a1);
        return(2);
    }    [9]
 
    else if ((wc >=0x303c) && (wc <= 0x5f19)) {
        s[0] = 0x8f;
        s[1] = (char) (((wc - 0x303c) >> 7) + 0x00a1);
        s[2] = (char) (((wc - 0x303c) & 0x007f) + 0x00a1);
        return(3);
    }    [10]
 
    else if ((wc >=0x5f1a) && (wc <= 0x8df7)) {
        s[0] = (char) (((wc - 0x5f1a) >> 7) + 0x00a1);
        s[1] = (char) (((wc - 0x5f1a) & 0x007f) + 0x0021);
        return(2);
    }    [11]
 
    _Seterrno(EILSEQ);
    return(-1);    [12]
}

  1. Includes header files that contain constants and structures required for this method [Return to example]

  2. Describes the conversion algorithm that this method uses

    Each character set supported by the codeset corresponds to a unique range of wide-character (process code) values and, within each character set, multibyte characters are of uniform length (1, 2, or 3 bytes). Therefore, the range in which each wide-character value falls indicates the number of bytes required for the character in multibyte format; the wide-character value itself determines the specific byte value or values for the character in multibyte format. [Return to example]

  3. Points, through s, to a buffer that stores the multibyte character [Return to example]

  4. Defines the wc variable that stores the wide character [Return to example]

  5. Points, through handle, to a structure that stores pointers to the methods that parse the character maps for this locale [Return to example]

  6. Returns zero (0) to indicate that no characters were converted if s points to NULL [Return to example]

  7. If the wide-character value is equal to or less than 0x9f, moves that value into the first byte of the s array and returns 1 to indicate that the converted character is 1 byte in length [Return to example]

  8. If the wide-character value is in the range 0x0100 to 0x015d, moves the value 0x8e to the first byte and a calculated value to the second byte of the s array; returns 2 to indicate that the converted character is 2 bytes in length [Return to example]

  9. If the wide-character value is in the range 0x015e to 0x303b, moves calculated values to the first and second bytes of the s array and returns 2 to indicate that the converted character is 2 bytes in length [Return to example]

  10. If the wide-character value is in the range 0x303c to 0x5f19, moves 0x8f to the first byte and calculated values to the second and third bytes of the s array; returns 3 to indicate that the converted character is 3 bytes in length [Return to example]

  11. If the wide-character value is in the range 0x5f1a to 0x8df7, moves calculated values to the first and second bytes of the s array, and returns 2 to indicate that the converted character is 2 bytes in length [Return to example]

  12. Sets errno to [EILSEQ] and returns -1 to indicate that the wide-character value is invalid

    These statements execute if the wide-character values satisfy none of the preceding conditions. [Return to example]

7.3.1.10    Writing a Method for the wcswidth( ) Function

The wcswidth( ) function uses the wcswidth method to determine the number of columns required to display a wide-character string. By convention, a C source file for this method has the file name _ _wcswidth_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-19 shows the _ _wcswidth_sdeckanji.c file that defines the wcswidth method used for the ja_JP.sdeckanji locale.

Example 7-19:  The _ _wcswidth_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <sys/localedef.h>   
 
/*
The algorithm for this conversion is:
 
PC <= 0x009f:                 s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
                              s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
                              s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
                              s[1] = ((PC - 0x303c) >> 7) + 0x00a1
                              s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7  s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
                              s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   [2]
 
int _ _wcswidth_sdeckanji(
        const wchar_t *wcs,   [3]
        size_t n,   [4]
        _LC_charmap_t *hdl )   [5]
{
    int len;   [6]
    int i;   [7]
 
    if (wcs == (wchar_t *)NULL || *wcs == (wchar_t)NULL)
        return(0);   [8]
 
    len = 0;   [9]
    for (i=0; wcs[i] != (wchar_t)NULL && i<n; i++) {   [10]
 
        if (wcs[i] <= 0x9f)
             len += 1;   [11]
 
        else if ((wcs[i] >= 0x0100) && (wcs[i] <= 0x015d))
             len += 1;   [12]
 
        else if ((wcs[i] >=0x015e) && (wcs[i] <= 0x303b))
             len += 2;   [13]
 
        else if ((wcs[i] >=0x303c) && (wcs[i] <= 0x5f19))
            len += 2;   [14]
 
        else if ((wcs[i] >=0x5f1a) && (wcs[i] <= 0x8df7))
            len += 2;   [15]
 
        else
            return(-1);   [16]
    }   [17]
 
    return(len);   [18]
}

  1. Includes header files that contain constants and structures required for this method [Return to example]

  2. Describes the algorithm used to determine the required display width

    Note that each character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns. [Return to example]

  3. Points, through wcs, to a buffer that stores the wide-character string for which display width information is requested [Return to example]

  4. Defines a variable, n, that stores the maximum size of the wcs buffer [Return to example]

  5. Points, through hdl, to a structure that stores pointers to the methods that parse character maps for this locale [Return to example]

  6. Defines a variable, len, that stores the display width in bytes/columns [Return to example]

  7. Defines a variable, i, that functions as a loop counter [Return to example]

  8. Returns zero (0) if wcs contains or points to NULL [Return to example]

  9. Initializes len to zero (0) [Return to example]

  10. Begins a for loop that processes each wide character in the wcs buffer and increments the wide-character pointer [Return to example]

  11. Increments len by 1 if the value of the current wide character is less than or equal to 0x9f [Return to example]

  12. Increments len by 1 if the value of the current wide character is in the range 0x0100 to 0x015d [Return to example]

  13. Increments len by 2 if the value of the current wide character is in the range 0x015e to 0x303b [Return to example]

  14. Increments len by 2 if the value of the current wide character is in the range 0x303c to 0x5f19 [Return to example]

  15. Increments len by 2 if the value of the current wide character is in the range 0x5f1a to 0x8df7 [Return to example]

  16. Returns -1 to indicate that the string contains an invalid wide character

    This statement executes if a value that satisfies none of the preceding conditions is encountered in the string. The calling function, wcswidth( ), also returns -1 if the wide character is nonprintable; however, this condition is evaluated at the level of the calling function and does not need to be evaluated by the method. [Return to example]

  17. Ends the for loop that processes wide characters in the wcs buffer [Return to example]

  18. Returns len to indicate the number of columns required to display the wide-character string [Return to example]

7.3.1.11    Writing a Method for the wcwidth( ) Function

The wcwidth( ) function uses the wcwidth method to determine the number of columns required to display a wide character. By convention, a C source file for this method has the file name _ _wcwidth_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-20 shows the _ _wcwidth_sdeckanji.c file that defines the wcwidth method used with the ja_JP.sdeckanji locale.

Example 7-20:  The _ _wcwidth_sdeckanji Method for the ja_JP.sdeckanji Locale

#include <stdlib.h>   [1]
#include <wchar.h>   
#include <sys/localedef.h>   
 
/*
The algorithm for this conversion is:
 
PC <= 0x009f:                 s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
                              s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
                              s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
                              s[1] = ((PC - 0x303c) >> 7) + 0x00a1
                              s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7  s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
                              s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
 
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   [2]
 
int _ _wcwidth_sdeckanji(
        wint_t wc,   [3]
        _LC_charmap_t *hdl )   [4]
{
 
    if (wc == 0)
        return(0);   [5]    if (wc <= 0x9f)
        return(1);   [6]
 
    else if ((wc >= 0x0100) && (wc <= 0x015d))
        return(1);   [7]
 
    else if ((wc >=0x015e) && (wc <= 0x303b))
        return(2);   [8]
 
    else if ((wc >=0x303c) && (wc <= 0x5f19))
        return(2);   [9]
 
    else if ((wc >=0x5f1a) && (wc <= 0x8df7))
        return(2);   [10]
 
        return(-1);   [11]
}

  1. Includes header files that contain constants and structures required for this method [Return to example]

  2. Describes the algorithm used to determine the required display width

    Note that a character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns. [Return to example]

  3. Defines the wc variable that stores the wide character for which display width information is requested [Return to example]

  4. Points, through hdl, to a structure that stores pointers to the methods that parse character maps for this locale [Return to example]

  5. Returns zero (0) if the wide-character buffer is empty [Return to example]

  6. Returns 1 if the wide-character value is less than or equal to 0x009f [Return to example]

  7. Returns 1 if the wide-character value is in the range 0x0100 to 0x015d [Return to example]

  8. Returns 2 if the wide-character value is in the range 0x015e to 0x303b [Return to example]

  9. Returns 2 if the wide-character value is in the range 0x303c to 0x5f19 [Return to example]

  10. Returns 2 if the wide-character value is in the range 0x5f1a to 0x8df7 [Return to example]

  11. Returns -1 if the wide-character value is invalid

    The calling function, wcwidth( ), also returns -1 if the wide character is nonprintable; however, this condition is evaluated at the level of the calling function and does not need to be evaluated by the method. [Return to example]

7.3.2    Optional Methods

A locale can include methods in addition to those discussed in Section 7.3.1. If your locale uses methods but does not supply any for the functions associated with particular locale categories or some other locale-related functions, the localedef command applies default methods that handle process code for both single-byte and multibyte characters. The following list names the optional methods:

Writing optional methods requires detailed information about the internal interfaces to C library routines. This information is vendor proprietary and may be subject to change. In the rare cases where your locale must include an optional method, contact your technical support representative to request information.

7.3.3    Building a Shareable Library to Use with a Locale

Example 7-21 shows the compiler and linker command lines that are required to build the method source files into a shareable library that is used with the ja_JP.sdeckanji locale.

Example 7-21:  Building a Library of Methods Used with the ja_JP.sdeckanji Locale

cc -std0 -c \
   _ _mblen_sdeckanji.c _ _mbstopcs_sdeckanji.c \
   _ _mbstowcs_sdeckanji.c _ _mbtopc_sdeckanji.c \
   _ _mbtowc_sdeckanji.c _ _pcstombs_sdeckanji.c \
   _ _pctomb_sdeckanji.c _ _wcstombs_sdeckanji.c \
   _ _wcswidth_sdeckanji.c _ _wctomb_sdeckanji.c \
   _ _wcwidth_sdeckanji.c
 
ld -shared -set_version osf.1 -soname libsdeckanji.so -shared \
   -no_archive -o libsdeckanji.so \
   _ _mblen_sdeckanji.o _ _mbstopcs_sdeckanji.o \
   _ _mbstowcs_sdeckanji.o _ _mbtopc_sdeckanji.o \
   _ _mbtowc_sdeckanji.o _ _pcstombs_sdeckanji.o _ _pctomb_sdeckanji.o \
   _ _wcstombs_sdeckanji.o _ _wcswidth_sdeckanji.o _ _wctomb_sdeckanji.o \
   _ _wcwidth_sdeckanji.o \
   -lc

Refer to cc(1) and ld(1) for more information about the cc and ld commands and how you build shared libraries.

7.3.4    Creating a methods File for a Locale

The methods file contains an entry for each function that is defined in the methods shared library for use with the locale. The operation performed by the function is identified by a method keyword, followed by quoted strings with the name of the function and the path to the shared library that contains the function.

Example 7-22 shows the section of a methods file for the methods used with the ja_JP.sdeckanji locale. Because there is a mandatory list of methods that you must define if you want to override any C library interfaces, your methods file must always specify an entry for each of the required methods as shown in this example. The ja_JP.sdeckanji locale relies on default implementations for all optional methods, so Example 7-22 does not contain entries for any of the optional methods.

Example 7-22:  The methods File for the ja_JP.sdeckanji Locale

# sdeckanji.m   [1]
# <method_keyword> "<entry>" "<package>" "<library_path>"   [1]
 
METHODS    [2]
 
_ _mbstopcs "_ _mbstopcs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
_ _mbtopc   "_ _mbtopc_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
_ _pcstombs "_ _pcstombs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
_ _pctomb   "_ _pctomb_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
mblen      "_ _mblen_sdeckanji"    "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
mbstowcs   "_ _mbstowcs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
mbtowc     "_ _mbtowc_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
wcstombs   "_ _wcstombs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
wcswidth   "_ _wcswidth_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
wctomb     "_ _wctomb_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
wcwidth    "_ _wcwidth_sdeckanji"  "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  [3]
 
END METHODS    [4]

  1. Comment lines

    These lines specify the name of the methods file and the format of method entries. Note that the field identified in the format as <package> is ignored, but you must specify some string for this field in order to specify a library path. [Return to example]

  2. Header to mark start of method entries [Return to example]

  3. Entries for required methods [Return to example]

  4. Trailer to mark end of method entries [Return to example]

Refer to localedef(1) for detailed information about methods file entries.

7.4    Building and Testing the Locale

Use the localedef command to build a locale from its source files. Example 7-23 shows the command line needed to build the French locale used in most examples in this chapter. Assume for this example that all source files reside in the user's default directory and that the resulting locale is also created in that directory.

Example 7-23:  Building the fr_FR.ISO8859-1@example Locale

% localedef -f ISO8859-1.cmap \    [1]
-i fr_FR.ISO8859-1.src \   [2]
fr_FR.ISO8859-1@example   [3]

  1. The -f option specifies the character map source file. [Return to example]

  2. The -i option specifies the locale definition source file. [Return to example]

  3. The final argument to the command is the name of the locale. [Return to example]

When you are testing locales, particularly ones that are similar to standard locales installed on the system, you should add an extension to the locale name. Varying names with the at (@) extension allows you to specify the standard strings for language, territory, and codeset and still be sure that the test locale is uniquely identified. This is important if you later decide to move the locale to the /usr/lib/nls/loc directory where other locales reside.

Example 7-23 shows only one form and a few options for the localedef command. The localedef(1) reference page is a complete description of the command. The following is a summary of some important rules and options:

By default, locales must reside in the /usr/lib/nls/loc directory to be found. If you want to test your locale before moving it to the /usr/lib/nls/loc directory, you can define the LOCPATH variable to specify the directory where your locale is located. You can then define the LANG environment variable to be your new locale and interactively test the locale with commands and applications.

Example 7-24 uses the date command to test the date/time format.

Example 7-24:  Setting the LOCPATH Variable and Testing a Locale

% setenv LOCPATH ~harry/locales
% setenv LANG fr_FR.ISO8859-1@example
% date
ven 23 avr 13:43:05 EDT 1999

Note

The LOCPATH variable is an extension to specifications in the X/Open UNIX standard and therefore may not be recognized on all systems that conform to this standard.

Some programs have support files that are installed in system directories with names that exactly match the names of standard locales. In such cases, application software, system software, or both might use the value of the LANG environment variable to determine the locale-specific directory in which the support files reside. If assigned directly to the LANG or LC_ALL environment variable, locale file names with an at (@) suffix may result in invalid search paths for some applications. The following example shows how you can work around this problem by assigning the standard locale name to the LANG variable and the name of your variant locale to the locale category variables. You need to make assignments only to those category variables that represent areas where your locale differs from the locale on which it is based.


% setenv LANG fr_FR.ISO8859-1
% setenv LC_CTYPE fr_FR.ISO8859-1@example
% setenv LC_COLLATE fr_FR.ISO8859-1@example

.
.
.
% setenv LC_TIME fr_FR.ISO8859-1@example