This chapter explains how to develop a locale, which
provides information appropriate for a particular combination of language,
territory, and codeset.
You use the
localedef
command to
create locales from the following files:
A character map source file (charmap
)
The
charmap
(4)
reference page explains the format and rules
for this file.
This chapter includes a
charmap
example
that conforms to binary character encodings specified for the ISO Latin-1
codeset, which defines all characters as single 8-bit bytes.
The chapter
also includes an example that shows part of a
charmap
file
for the SJIS codeset, which defines both single-byte and multibyte characters.
A locale source file
The
locale
(4)
reference page explains the rules and format for
this file.
This chapter includes an example in which a locale named
fr_FR.ISO8859-1@example
that supports the language and customs of
France is developed.
A methods file with associated shareable library
These files are required when the
charmap
file defines
multibyte characters; otherwise, the files are optional.
The methods file
specifies the shareable library that contains redefinitions of the C Library
interfaces that convert data to and from internal process (wide-character)
encoding.
7.1 Creating a Character Map Source File for a Locale
A
charmap
file
defines symbols for character binary encodings.
The
localedef
command uses this file to map character symbols in a locale source file to
the character encodings.
Example 7-1
shows a fragment of the
ISO8859-1.cmap
source file that is used in the
fr_FR.ISO8859-1@example
locale being developed in this chapter.
Section E.1
contains the
ISO8859-1.cmap
file in its entirety.
Example 7-1: The charmap File for a Sample Locale
# [1] # Charmap for ISO 8859-1 codeset [1] # [1] <code_set_name> "ISO8859-1" [2] <mb_cur_max> 1 [2] <mb_cur_min> 1 [2] <escape_char> \ [2] <comment_char> # [2] CHARMAP [3] # Portable characters and other standard [1] # control characters [1] <NUL> \x00 [4] <SOH> \x01 <STX> \x02 <ETX> \x03 <EOT> \x04 <ENQ> \x05 <ACK> \x06 <BEL> \x07 <alert> \x07 <backspace> \x08 <tab> \x09 <newline> \x0a <vertical-tab> \x0b <form-feed> \x0c <carriage-return> \x0d <SO> \x0e
.
.
.
<zero> \x30 [4] <one> \x31 <two> \x32 <three> \x33 <A> \x41 <B> \x42 <C> \x43 <D> \x44
.
.
.
<underscore> \x5f [4] <low-line> \x5f <grave-accent> \x60 <a> \x61 <b> \x62 <c> \x63 <d> \x64
.
.
.
# Extended control characters [1] # (names taken from ISO 6429) [1] <PAD> \x80 [4] <HOP> \x81 <BPH> \x82 <NBH> \x83 <IND> \x84
.
.
.
# Other graphic characters [1] <nobreakspace> \xa0 [4] <inverted-exclamation-mark> \xa1
.
.
.
END CHARMAP [5]
Comment line
By default, the comment character is the number sign (#).
You
can override this default with a
<comment_char>
definition
(see 2). [Return to example]
Keyword declarations
This example provides entries for all valid declarations and specifies
default values for all but
<code_set_name>
.
Usually,
you specify a declaration only when you want to override
its default value.
In this example, the declarations
for
<escape_char>
and
<comment_char>
specify the default values for the escape character and comment character,
respectively.
The value for
<mb_cur_max>
, the maximum
length (in bytes) of a character, is 1 for this particular
charmap
file.
The value for
<mb_cur_min>
, the minimum
length (in bytes) of a character, must be 1 in
charmap
files for all locales.
(All locales include characters in the Portable Character
Set, which defines single-byte characters.)
The
<code_set_name>
value is the value returned on the
nl_langinfo(CODESET)
call made by applications that bind to the
locale at run time. [Return to example]
Header marking start of character maps [Return to example]
Symbol-to-coding maps for characters
Each character map consists of a symbolic name and encoding. The name and encoding are separated by one or more spaces.
A symbolic name begins with the left angle bracket (<) and ends with
the right angle bracket (>).
The characters between the angle brackets can
be any characters from the Portable Character Set, except for control and
space characters.
If the name includes more than one right angle bracket (>),
all but the last one must be preceded by the value of
<escape_character>
.
A symbolic name cannot exceed 128 bytes in length.
An encoding can be one or more decimal, octal, or hexadecimal constants. (Multiple constants apply to multibyte encodings.) The constants have the following formats:
decimal
\dnnn
or
\dnn
, where
n
is a decimal
digit
hexadecimal
\xnn
, where
n
is a hexadecimal digit
octal
\nnn
or
\nn
, where
n
is an octal
digit
You can define multiple character map entries (each with a different symbolic name) for the same encoding value. This example does not define multiple symbolic names for the same encoding value. [Return to example]
Trailer marking end of character maps [Return to example]
The source files for codesets with multibyte characters have more complex
character maps.
Example 7-2
shows a subset of character
map entries from a source file for the Japanese SJIS codeset.
This source
file specifies entries from several character sets that must be supported
within the same codeset.
Example 7-2: Fragment from a charmap File for a Multibyte Codeset
# SJIS charmap # <code_set_name> "SJIS" [1] <mb_cur_min> 1 [2] <mb_cur_max> 2 [3] CHARMAP # # CS0: ASCII #
.
.
.
<commercial-at> \x40 [4] <A> \x41 [4] <B> \x42 [4]
.
.
.
# # CS1: JIS X0208-1983 for ShiftJIS. # <zenkaku-space> \x81\x40 [5] <j0101>...<j0163> \x81\x40 [5] <j0164>...<j0194> \x81\x80 [5]
.
.
.
# # UDC Area in JIS X0208 plane # <u8501>...<u8563> \xeb\x40 [6] <u8564>...<u8594> \xeb\x80 [6] <u8601>...<u8663> \xeb\x9f [6]
.
.
.
# # CS2: JIS X0201 (so-called Hankaku-Kana) # <kana-fullstop> \xa1 [7]
.
.
.
<kana-conjunctive> \xa5 [7] <kana-WO> \xa6 [7] <kana-a> \xa7 [7]
.
.
.
END CHARMAP
Codeset name [Return to example]
Minimum number of bytes per character
This value must be 1. [Return to example]
Maximum number of bytes per character
In SJIS, the largest multibyte character is 2 bytes in length. [Return to example]
Symbols and encodings for ASCII characters [Return to example]
Symbols and encodings for SJIS characters
Note how character symbols are specified as a range and how two hexadecimal values determine the encoding for a 2-byte character.
When symbols are specified as a range of symbol values, the specified
character encoding applies to the first symbol in the range.
The
localedef
command automatically increments both the symbol value
and the encoding value to create symbols and encodings for all characters
in the range. [Return to example]
Maps for user-defined characters within the SJIS codeset
These maps establish ranges of encodings for which users can later define characters. [Return to example]
Maps for the single-byte characters of the Hankaku-Kana character set [Return to example]
Refer to
charmap
(4)
for a complete list of rules that apply to character
map source files.
Note
The symbolic names for characters in character map source files are in the process of becoming standardized. A future revision of the X/Open UNIX standard will likely specify both long and short symbolic names for characters.
The symbolic names for characters shown in this example are not necessarily the names being proposed for adoption by any standards group.
7.2 Creating Locale Definition Source Files
A locale definition source file defines data that is specific
to a particular language and territory.
The source file is organized into
sections, one for each category of locale data being defined.
Example 7-3
shows the structure of a locale definition source file in pseudocode.
The
sections for locale categories are discussed in more detail following the
example.
Example 7-3: Structure of Locale Source Definition File
# comment-line [1] comment_char <char_symbol1> [2] escape_char <char_symbol2> [3] CATEGORY_NAME [4] category_definition-statement [5] category_definition-statement [5]
.
.
.
END CATEGORY_NAME [6]
.
.
.
[7]
Comment line
The number sign (#
)
is the default comment character.
You can specify comments as entire lines
by entering the comment character in the first column of the line.
You cannot
specify comments on the same lines as definition statements in locale source
files.
In this respect, locale source files differ from character map source
files. [Return to example]
Redefinition of comment character
You can override the default comment character with an entry line that
begins with the
comment_char
keyword, followed by the symbol
for the desired character.
The character symbol is defined in the character
map (charmap
) source file for the locale. [Return to example]
Redefinition of escape character
The escape character, by default the backslash (\
),
is used in decimal, hexadecimal, and octal constants and to indicate when
definition statements are continued to the next line of the source file.
You can override the default escape character with an entry line that begins
with the
escape_char
keyword, followed by one or more blank
characters, then the symbol for the desired character.
The character symbol
is defined in the character map source file for the locale. [Return to example]
Header for locale category section
Section headers correspond to category names, which are
LC_CTYPE
,
LC_COLLATE
,
LC_NUMERIC
,
LC_MONETARY
,
LC_MESSAGES
, and
LC_TIME
. [Return to example]
Definition statement for the category
The format of these statements varies from one category to the next. In general, a statement begins with a keyword, followed by one or more spaces or tabs, then the definition itself.
In place of
any category definition statements, you can include a
copy
statement to include definition statements in another locale source file.
For example:
copy en_US.ISO8859-1
If you include a
copy
statement, you can include
no other statements in the category. [Return to example]
Trailer for locale category section
Section trailers start with the
END
keyword, followed
by the category name. [Return to example]
You can include sections for all locale categories or only a subset of categories. If you omit a section for a locale category from the source file, the definition for the omitted category is the same as defined for the POSIX, or C, locale. [Return to example]
The following sections describe specific locale categories and include
parts of the
fr_FR.ISO8859-1@example.src
locale source
file.
Section E.2
contains this source file in its entirety.
7.2.1 Defining the LC_CTYPE Locale Category
The
LC_CTYPE
section of a locale
source file defines character classes and character attributes used in operations
such as case conversion.
Example 7-4
shows the definition
for this section.
Example 7-4: LC_CTYPE Category Definition
############# LC_CTYPE [1] ############# upper <A>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;\ <N>;<O>;<P>;<Q>;<R>;<S>;<T>;<U>;<V>;<W>;<X>;<Y>;<Z>;\ <A-grave>;\
.
.
.
<U-diaeresis> [2] lower <a>;<b>;<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;\ <n>;<o>;<p>;<q>;<r>;<s>;<t>;<u>;<v>;<w>;<x>;<y>;<z>;\ <a-grave>;\
.
.
.
<u-diaeresis> [2] space <tab>;<newline>;<vertical-tab>;<form-feed>;\ <carriage-return>;<space> [2] cntrl <NUL>;<SOH>;<STX>;<ETX>;<EOT>;<ENQ>;<ACK>;\ <alert>;<backspace>;<tab>;<newline>;<vertical-tab>;\ <form-feed>;<carriage-return>;\
.
.
.
<SOS>;<SGCI>;<SCI>;<CSI>;<ST>;<OSC>;<PM>;<APC> [2] graph <exclamation-mark>;<quotation-mark>;<number-sign>;\
.
.
.
<u-circumflex>;<u-diaeresis>;<y-acute>;<thorn-icelandic>;<y-diaeresis> [2] # print class includes everything in the graph class above, plus <space>. print <exclamation-mark>;<quotation-mark>;<number-sign>;\
.
.
.
<u-circumflex>;<u-diaeresis>;<y-acute>;<thorn-icelandic>;<y-diaeresis>;\ <space> [2] punct <exclamation-mark>;<quotation-mark>;<number-sign>;\ <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\ <left-parenthesis>;<right-parenthesis>;<asterisk>;\ <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\ <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\ <greater-than-sign>;<question-mark>;<commercial-at>;\ <left-square-bracket>;<backslash>;<right-square-bracket>;\ <circumflex>;<underscore>;<grave-accent>;<left-brace>;\ <vertical-line>;<right-brace>;<tilde> [2] digit <zero>;<one>;<two>;<three>;<four>;\ <five>;<six>;<seven>;<eight>;<nine> [2] xdigit <zero>;<one>;<two>;<three>;<four>;\ <five>;<six>;<seven>;<eight>;<nine>;\ <A>;<B>;<C>;<D>;<E>;<F>;\ <a>;<b>;<c>;<d>;<e>;<f> [2] blank <space>;<tab> [2] toupper (<a>,<A>);(<b>,<B>);(<c>,<C>);(<d>,<D>);(<e>,<E>);\ (<f>,<F>);(<g>,<G>);(<h>,<H>);(<i>,<I>);(<j>,<J>);\ (<k>,<K>);(<l>,<L>);(<m>,<M>);(<n>,<N>);(<o>,<O>);\ (<p>,<P>);(<q>,<Q>);(<r>,<R>);(<s>,<S>);(<t>,<T>);\ (<u>,<U>);(<v>,<V>);(<w>,<W>);(<x>,<X>);(<y>,<Y>);\ (<z>,<Z>);\ (<a-grave>,<A-grave>);\ (<a-circumflex>,<A-circumflex>);\ (<ae-ligature>,<AE-ligature>);\ (<c-cedilla>,<C-cedilla>);\ (<e-grave>,<E-grave>);\ (<e-acute>,<E-acute>);\ (<e-circumflex>,<E-circumflex>);\ (<e-diaeresis>,<E-diaeresis>);\ (<i-circumflex>,<I-circumflex>);\ (<i-diaeresis>,<I-diaeresis>);\ (<o-circumflex>,<O-circumflex>);\ (<u-grave>,<U-grave>);\ (<u-circumflex>,<U-circumflex>);\ (<u-diaeresis>,<U-diaeresis>) [3] # tolower class is the inverse of toupper. tolower (<A>,<a>);(<B>,<b>);(<C>,<c>);(<D>,<d>);(<E>,<e>);\ (<F>,<f>);(<G>,<g>);(<H>,<h>);(<I>,<i>);(<J>,<j>);\ (<K>,<k>);(<L>,<l>);(<M>,<m>);(<N>,<n>);(<O>,<o>);\ (<P>,<p>);(<Q>,<q>);(<R>,<r>);(<S>,<s>);(<T>,<t>);\ (<U>,<u>);(<V>,<v>);(<W>,<w>);(<X>,<x>);(<Y>,<y>);\ (<Z>,<z>);\ (<A-grave>,<a-grave>);\ (<A-circumflex>,<a-circumflex>);\ (<AE-ligature>,<ae-ligature>);\ (<C-cedilla>,<c-cedilla>);\ (<E-grave>,<e-grave>);\ (<E-acute>,<e-acute>);\ (<E-circumflex>,<e-circumflex>);\ (<E-diaeresis>,<e-diaeresis>);\ (<I-circumflex>,<i-circumflex>);\ (<I-diaeresis>,<i-diaeresis>);\ (<O-circumflex>,<o-circumflex>);\ (<U-grave>,<u-grave>);\ (<U-circumflex>,<u-circumflex>);\ (<U-diaeresis>,<u-diaeresis>) [3] END LC_CTYPE [4]
Section header [Return to example]
Definition of character class
These definitions start with a keyword that stands for the character class (also referred to as a property), followed by one or more blank characters, then a list of symbols for all characters in that class. You can substitute the character's encoding for its symbol; however, specifying characters by their encodings diminishes the readability of the locale source file and makes it impossible to use the file with more than one codeset.
Although not illustrated in the example, you can specify a horizontal
elipsis (...
) to represent a range of characters.
In the
string
<NUL>;...;<tab>
, for example, the ellipsis
represents all characters whose encodings are between the character whose
symbol is
<NUL>
and the character whose symbol is
<tab>
.
The symbols and their encodings are specified in the
charmap
file for the locale.
Character classes as defined by the X/Open UNIX standard are represented by the following keywords:
upper
(uppercase letter characters)
lower
(lowercase letter characters)
alpha
(all letter characters)
By default, this class is the combination of characters specified for
the
upper
and
lower
classes.
The
alpha
class is not explicitly defined in the sample locale, so the
default definition applies.
space
(white-space characters)
cntrl
(control characters)
punct
(punctuation characters)
digit
(numeric digits)
xdigit
(hexadecimal digits)
blank
(blank characters)
graph
By default, this class is the combination of characters in the
alpha
,
digit
, and
punct
classes.
print
By default, this class is the combination of characters in the
alpha
,
digit
, and
punct
classes,
plus the space character.
From the application standpoint, there is also the class
alnum
.
This class is rarely defined in a locale because it is always
a combination of characters in the
alpha
and
digit
classes.
Unicode (*.UTF-8
) locales include character classes
as defined by the Unicode standard.
See
locale
(4)
for details about
character classification for Unicode.
Certain locales, such as those for Asian languages like Japanese, may define nonstandard character classes. [Return to example]
Definitions of case conversion for letter characters
These definitions, which begin with the keywords
toupper
and
tolower
, list symbols in pairs rather than individually.
In the
toupper
definition shown here, the first symbol
in the pair is the symbol for a lowercase letter and the second symbol is
the symbol for that letter's uppercase equivalent.
This definition determines
what a letter is converted to when functions, like
towupper()
and
towlower()
, perform case conversion on text data.
Locales that define nonstandard character classes may define other property
conversion definitions that are used by the
wctrans()
and
towctrans()
functions.
Section trailer [Return to example]
The preceding example does not completely illustrate all the options
you can use when defining the
LC_CTYPE
category.
You can:
Use a
copy
statement to include the entire
category definition from another locale
When you use a
copy
statement, it must be the only
entry between the section trailer and header.
Omit any of the standard character classes or define different character classes
The standard character classes are language specific. Therefore, the standard character classes may not apply to all languages. Define for a locale only the standard character classes that are appropriate for the locale's language. Depending on the language, it may be necessary to define nonstandardized classes.
A definition for a nonstandardized character class must be preceded
by the
charclass
statement to define a keyword for the
class, followed by the class definition.
For example:
charclass vowel vowel <a>;<e>;<i>;<o>;<u>;<y>
Applications can use the
wctype( )
and
iswctype( )
functions to determine and test all character
classes (including user-defined ones).
Applications can use class-specific
functions, such as
iswalpha
and
iswpunct
to test the standard character classes.
Note
The
LC_CTYPE
category of thefr_FR.ISO8859-1@example
locale is limited to letter characters in the French language. Some locale developers would define character classes to include characters in all the languages supported by the ISO 8859-1 character set. This practice allows locales for multiple Western European languages to use the sameLC_CTYPE
source definitions through acopy
statement.
Refer to
locale
(4)
for additional rules and restrictions that apply
to the
LC_CTYPE
category definition.
7.2.2 Defining the LC_COLLATE Locale Category
The
LC_COLLATE
section of a locale source file specifies how characters and strings are collated.
Example 7-5
shows part of an
LC_COLLATE
section.
Example 7-5: LC_COLLATE Category Definition
LC_COLLATE [1] order_start forward;backward;forward [2] <NUL> [3] <SOH> <STX> <ETX> <EOT> <ENQ> <ACK> <alert> <backspace> <tab>
.
.
.
<APC> [3] <space> <space>;<space>;<space> <exclamation-mark> <exclamation-mark>;<exclamation-mark>;<exclamation-mark> <quotation-mark> <quotation-mark>;<quotation-mark>;<quotation-mark>
.
.
.
<a> <a>;<a>;<a> [3] <A> <a>;<a>;<A> <feminine> <a>;<feminine>;<feminine> <a-acute> <a>;<a-acute>;<a-acute> <A-acute> <a>;<a-acute>;<A-acute> <a-grave> <a>;<a-grave>;<a-grave> <A-grave> <a>;<a-grave>;<A-grave> <a-circumflex> <a>;<a-circumflex>;<a-circumflex> <A-circumflex> <a>;<a-circumflex>;<A-circumflex> <a-ring> <a>;<a-ring>;<a-ring> <A-ring> <a>;<a-ring>;<A-ring> <a-diaeresis> <a>;<a-diaeresis>;<a-diaeresis> <A-diaeresis> <a>;<a-diaeresis>;<A-diaeresis> <a-tilde> <a>;<a-tilde>;<a-tilde> <A-tilde> <a>;<a-tilde>;<A-tilde> <ae-ligature> <a>;<a><e>;<a><e> <AE-ligature> <a>;<a><e>;<A><E> <b> <b>;<b>;<b> <B> <b>;<b>;<B> <c> <c>;<c>;<c> <C> <c>;<c>;<C> <c-cedilla> <c>;<c-cedilla>;<c-cedilla> <C-cedilla> <c>;<c-cedilla>;<C-cedilla>
.
.
.
<z> <z>;<z>;<z> [3] <Z> <z>;<z>;<Z> UNDEFINED [4] order_end [5] END LC_COLLATE [6]
Section header [Return to example]
An
order_start
keyword
that marks the beginning of a section with statements that assign collating
weights to elements
Following the
order_start
keyword on the same line
are sort directives, separated by semicolons (;) that apply to each sorting
pass.
Sort directives can include the following keywords.
forward
, which specifies that the comparison
operation proceeds from the start of the string towards the end of the string
backward
, which specifies that the comparison
operation proceeds from the end of the string towards the start of the string
position
, which specifies that the comparison
operation considers the relative position of characters in the string that
are not subject to the collating weight
IGNORE
; (in other
words,
position
ensures that nonignored characters that
are the shortest distance from the start (forward,position
)
or end (backward,position
) of the string collate first.
When a sort directive includes two keywords, the
position
keyword combined with either
forward
or
backward
, the two keywords are separated by a comma (,).
The
position
keyword by itself is equivalent to the directive
forward,position
.
The number of sort directives corresponds to the number of weights each collating element is assigned in subsequent statements.
Each sort directive and its associated set of weights specify information
for one pass, or level, of string comparison.
The first directive applies
when the string comparison operation applies the primary weight, the second
when the string comparison operation applies the secondary weight, and so
on.
The number of levels required to collate strings correctly depends on
language and cultural requirements and therefore varies from one locale to
another.
There is also a level number maximum, associated
with the
COLL_WEIGHTS_MAX
setting in the
limits.h
and
sys/localedef.h
files.
On Tru64 UNIX
systems, you are limited to six collation levels (sort directives).
The
backward
directive is used for many languages
to ensure that accented characters sort after unaccented characters only if
the compared strings are otherwise equivalent.
The
position
directive is frequently used to handle
characters, such as the hyphen (-) in Western European languages, whose significance
can be relative to word position.
For example, assume you wanted the word "o-ring"
to collate in a word list before the word "or-ing", but do not
want the hyphen to be considered until after strings are sorted by letters
alone.
You would need two sort directives and associated sets of weight specifiers
to implement this order.
For the first comparison operation, you specify
forward
as the sort directive, letters as the first weights for
all letter characters, and
IGNORE
as the weight for the
hyphen character.
For the second, or a later, comparison operation, you specify
forward position
as the sort directive,
IGNORE
as the weight for all letter characters, and the hyphen as the weight for
the hyphen character.
If you do not specify a sort directive, the default is
forward
. [Return to example]
Collation order statements for elements
These statements specify a character symbol, optionally followed by one or more blank characters (spaces or tabs), then the symbols for characters that have the same weight at each stage of the sort.
In the example, the sort order is control characters, followed by punctuation and digits, and then letters. Letters are sorted on multiple passes, with diacritics and case ignored on the first pass, diacritics being significant on the second pass, and case being significant on the third pass. [Return to example]
Collation order statement for characters not specified in other collation order statements
The
UNDEFINED
keyword begins a collation order statement
to be applied to all characters that are defined in the locale's
charmap
file but not specified in other collation order statements.
Characters that fall into the
UNDEFINED
category are considered
in regular expressions to belong to the same equivalence class.
You should always include the
UNDEFINED
collation
order statement.
If this statement is absent, the
localedef
command includes undefined characters at the end of the collating order and
issues a warning.
Furthermore, if you place an
UNDEFINED
statement as the last collation order statement, the
localedef
command can sometimes compress all undefined characters into one entry.
This
action can reduce the size of the locale.
This locale specifies that any characters specified in the locale's
charmap
file but not handled by other collation order statements
be ordered last.
An
UNDEFINED
statement can have an operand.
For example,
the
IGNORE
keyword causes any characters unspecified by
other collation order statements to be ignored for the sort pass in which
IGNORE
appears.
If the following
UNDEFINED
statement
had been included in the example, characters not specified in other collation
order statements would be ignored in all sort passes defined by those statements:
UNDEFINED IGNORE;IGNORE;IGNORE
Trailer to indicate the end of collation order statements [Return to example]
Trailer to indicate the end of the
LC_COLLATE
section [Return to example]
The preceding example shows only a few of the options that you can specify
when defining the
LC_COLLATE
category.
You can also use:
A
copy
statement to include the entire
category definition from another locale
A
copy
statement can be the only entry between the
section trailer and header.
Collating order statements that specify a string of characters, rather than single characters, as the collating elements
In such cases, you first specify
collating-element
statements before the
order_start
statement to define symbols
for the strings.
You can then specify those symbols in collating order statements.
For example:
collating-element <ch> from "<c><h>"
.
.
.
order_start forward;forward;backward
.
.
.
<ch> <Ch>;<ch>;<ch>
.
.
.
Symbolic names, such as
<UPPERCASE>
,
to use as weight specifiers in collation order statements
You must define each symbolic name by using the
collating-symbol
statement in the source file before the
order_start
statement.
You then include the symbol in the appropriate position in the
list of collation order statements for collating elements.
For example, if
you wanted the symbol
<LOW>
to represent the lowest
position in the collating order,
<LOW>
would be the
line entry immediately following the
order_start
statement.
A symbol such as
<UPPERCASE>
would be positioned on
the line immediately preceding the section of collating order statements for
uppercase letters.
A symbol must occur before the first collation order statement in which it is used. Therefore, you cannot define a symbol for the highest position in the collating order.
After symbols are defined and positioned, you can use them as weights in collating order statements. For example:
collating-symbol <LOWERCASE> collating-symbol <UNACCENTED>
.
.
.
order_start forward;backward;forward;forward
.
.
.
<UNACCENTED>
.
.
.
<LOWERCASE> <a> <a>;<UNACCENTED>;<LOWERCASE>;IGNORE
.
.
.
Refer to
locale
(4)
for more detailed information on the
LC_COLLATE
category definition.
7.2.3 Defining the LC_MESSAGES Locale Category
The
LC_MESSAGES
section of a locale source
file defines strings that are valid for affirmative and negative responses
from users.
Example 7-6
shows an
LC_MESSAGES
section.
Example 7-6: LC_MESSAGES Category Definition
LC_MESSAGES [1] # yes expression. The following designates: # "^([oO]|[oO][uU][iI])" yesexpr "<circumflex><left-parenthesis>\ <left-square-bracket><o><O><right-square-bracket>\ <vertical-line><left-square-bracket><o><O>\ <right-square-bracket><left-square-bracket><u><U>\ <right-square-bracket><left-square-bracket><i><I>\ <right-square-bracket><right-parenthesis>" [2] # no expression. The following designates: # "^([nN]|[nN][oO][nN])" noexpr "<circumflex><left-parenthesis>\ <left-square-bracket><n><N><right-square-bracket>\ <vertical-line><left-square-bracket><n><N>\ <right-square-bracket><left-square-bracket><o><O>\ <right-square-bracket><left-square-bracket><n><N>\ <right-square-bracket><right-parenthesis>" [3] # yes string. The following designates: "oui:o:O" yesstr "<o><u><i><colon><o><colon><O>" [4] # no string. The following designates: "non:n:N" nostr "<n><o><n><colon><n><colon><N>" [5] END LC_MESSAGES [6]
Section header [Return to example]
Definition of an expression for a valid "yes" response
This entry
consists of the
yesexpr
keyword, followed by one or more
spaces or tabs, and an extended regular expression that is delimited by double
quotation marks.
This expression specifies that "oui" or "o"
(case is ignored) is a valid affirmative response in this locale.
Note that
the regular expression for
yesexpr
specifies individual
characters by their symbols as defined in the locale's
charmap
file. [Return to example]
Definition of an expression for a valid "no" response
This entry consists
of the
noexpr
keyword, followed by one or more spaces or
tabs, and an extended regular expression that is delimited by double quotation
marks.
This expression specifies that "non" or "n" (case is ignored) is a valid affirmative response in this locale. [Return to example]
Definition of a string for a valid "yes" response
This entry consists
of the
yesstr
keyword, followed one or more spaces or tabs,
and a fixed string that is delimited by double quotation marks.
The
yesstr
entry is marked as LEGACY in the X/Open
UNIX standard and is not included in the POSIX standard; however, some applications
and systems software still might use
yesstr
rather than
yesexpr
.
To ensure that your locale works correctly with such software,
you should define
yesstr
in your locale.
Note that the
X/Open UNIX standard defines a single fixed string for
yesstr
.
The colon (:) separator, which allows multiple fixed strings to be specified,
is an extension to the standard definition. [Return to example]
Definition of a string for a valid "no" response
This entry consists
of the
nostr
keyword, followed one or more spaces or tabs,
and a fixed string that is delimited by double quotation marks.
The
nostr
entry is marked as LEGACY in the X/Open
UNIX standard and is not included in the POSIX standard; however, some applications
and systems software still might use
nostr
rather than
noexpr
.
To ensure that your locale works correctly with such software,
you should define
nostr
in your locale.
Note that the X/Open
UNIX standard defines a single fixed string for
nostr
.
The colon (:) separator, which allows multiple fixed strings to be specified,
is an extension to the standard definition. [Return to example]
Section trailer [Return to example]
As an alternative to specifying symbol definitions, you can use the
copy
statement between the section header and trailer to duplicate
an existing locale's definition of the
LC_MESSAGES
category.
The
copy
statement represents a complete definition of
the category and cannot be used along with explicit symbol definitions.
7.2.4 Defining the LC_MONETARY Locale Category
The
LC_MONETARY
section of the locale source
file defines the rules and symbols used to format monetary values.
Application
developers use the
localeconv( )
and
nl_langinfo( )
functions to determine the information defined in this section
and apply formatting rules through the
strfmon( )
function.
Example 7-7
shows an
LC_MONETARY
section.
Example 7-7: LC_MONETARY Category Definition
LC_MONETARY [1] int_curr_symbol "<F><R><F><space>" [2] currency_symbol "<F>" [2] mon_decimal_point "<comma>" [2] mon_thousands_sep "" [2] mon_grouping 3;0 [2] positive_sign "" [2] negative_sign "<hyphen>" [2]
.
.
.
END LC_MONETARY [3]
Section header [Return to example]
Symbol definitions
The entries in the example specify the following:
The international currency symbol is
FRF
(French Franc) and the local currency symbol is
F
(Franc).
The decimal point is the comma (,
).
No character is defined to group digits to the left of the decimal point.
The number of digits in each grouping to the left of the decimal point. In this locale, digits are grouped in threes. Because this locale does not define a default monetary thousands separator, the monetary grouping defined in this locale is significant only if the application uses a function to specify a thousands separator.
The positive sign is null.
The negative sign is the minus (-
)
character.
Section trailer [Return to example]
The following list describes the symbol names you can define in the
LC_MONETARY
section.
int_curr_symbol
The international currency symbol
currency_symbol
The local currency symbol
mon_decimal_point
The radix character, or decimal point, used in monetary formats
mon_thousands_sep
The character used to separate groups of digits to the left of the radix character
mon_grouping
The size of each group of digits to the left of the radix character.
The character defined by
mon_thousands_sep
, if any, is
inserted between the groups defined by
mon_grouping
.
You
can vary the size of groups by specifying multiple digits separated by a semicolon
(;).
For example,
3;2
specifies that the first group to
the left of the radix character contains three digits and all subsequent groups
contain 2 digits.
On Tru64 UNIX systems,
3;0
and
3
are equivalent; that is, all digits to the left of the decimal
point are grouped by three.
positive_sign
The string indicating that a monetary value is nonnegative
negative_sign
The string indicating that a monetary value is negative
int_frac_digits
The number of digits to be written to the right of the radix character
when
int_curr_symbol
appears in the format
frac_digits
The number of digits to be written to the right of the radix character
when
currency_symbol
appears in the format
p_cs_precedes
An integer that determines if the international or local currency symbol precedes a nonnegative value
p_sep_by_space
An integer that determines whether a space separates the international or local currency symbol from other parts of a formatted, nonnegative value
n_cs_precedes
An integer that determines if the international or local currency symbol precedes a negative value
n_sep_by_space
An integer that determines whether a space separates the international or local currency symbol from other parts of a formatted, negative value
p_sign_posn
An integer that indicates if or how the positive sign string is positioned in a nonnegative, formatted value
n_sign_posn
An integer that indicates how the negative sign string is positioned in a negative, formatted value
As an alternative to specifying symbol definitions, you can use the
copy
statement between the section header and trailer to duplicate
an existing locale's definition of
LC_MONETARY
.
The
copy
statement represents a complete definition of the category
and cannot be used along with explicit symbol definitions.
Refer to
locale
(4)
for complete information about specifying
LC_MONETARY
symbol definitions.
7.2.5 Defining the LC_NUMERIC Locale Category
The
LC_NUMERIC
section of the locale source
file defines the rules and symbols used to format numeric data.
You can use
the
localeconv( )
and
nl_langinfo( )
functions to access this formatting information.
Example 7-8
shows an
LC_NUMERIC
section.
Example 7-8: LC_NUMERIC Category Definition
LC_NUMERIC [1] decimal_point "<comma>" [2] thousands_sep "" [3] grouping 3;0 [4] END LC_NUMERIC [5]
Category header [Return to example]
Definition of radix character (decimal point) [Return to example]
Definition of character used to separate groups of digits to the left of the radix character. In this locale, no default character is defined. Therefore, applications must supply this character, if needed. [Return to example]
The size of each group of digits to the left
of the radix character.
The character defined by
thousands_sep
,
if any, is inserted between the groups defined by
grouping
.
You can vary the size of groups by specifying multiple digits separated
by a semicolon (;).
For example,
3;2
specifies that the
first group to the left of the radix character contains three digits and all
subsequent groups contain 2 digits.
On Tru64 UNIX systems,
3;0
and
3
are equivalent; that is, all digits to
the left of the radix character are group by threes. [Return to example]
Category trailer [Return to example]
The preceding example shows all of the symbols you can define in the
LC_NUMERIC
section.
In place of any symbol definitions, you can
specify a
copy
statement between the section header and
trailer to include this section from another locale.
Refer to
locale
(4)
for detailed rules about symbol definitions.
7.2.6 Defining the LC_TIME Locale Category
The
LC_TIME
section of
a locale source file defines the interpretation of field descriptors supported
by the
date
command.
This section also affects the behavior
of the
strftime( )
,
wcsftime( )
,
strptime( )
, and
nl_langinfo( )
functions.
Example 7-9
shows some of the symbols
defined for the sample French locale.
Example 7-9: LC_TIME Category Definition
LC_TIME [1] abday "<d><i><m>";\ "<l><u><n>";\ "<m><a><r>";\ "<m><e><r>";\ "<j><e><u>";\ "<v><e><n>";\ "<s><a><m>" [2] day "<d><i><m><a><n><c><h><e>";\ "<l><u><n><d><i>";\ "<m><a><r><d><i>";\ "<m><e><r><c><r><e><d><i>";\ "<j><e><u><d><i>";\ "<v><e><n><d><r><e><d><i>";\ "<s><a><m><e><d><i>" [3] abmon "<j><a><n>";\ "<f><e-acute><v>";\ "<m><a><r>";\ "<a><v><r>";\ "<m><a><i>";\ "<j><u><n>";\ "<j><u><l>";\ "<a><o><u-circumflex>";\ "<s><e><p>";\ "<o><c><t>";\ "<n><o><v>";\ "<d><e-acute><c>" [4] mon "<j><a><n><v><i><e><r>";\ "<f><e-acute><v><r><i><e><r>";\ "<m><a><r><s>";\ "<a><v><r><i><l>";\ "<m><a><i>";\ "<j><u><i><n>";\ "<j><u><i><l><l><e><t>";\ "<a><o><u-circumflex><t>";\ "<s><e><p><t><e><m><b><r><e>";\ "<o><c><t><o><b><r><e>";\ "<n><o><v><e><m><b><r><e>";\ "<d><e-acute><c><e><m><b><r><e>" [5] # date/time format. The following designates this # format: "%a %e %b %H:%M:%S %Z %Y" d_t_fmt "<percent-sign><a><space><percent-sign><e>\ <space><percent-sign><b><space><percent-sign><H>\ <colon><percent-sign><M><colon><percent-sign><S>\ <space><percent-sign><Z><space><percent-sign><Y>" [6]
.
.
.
END LC_TIME [7]
Section header [Return to example]
Abbreviated names for days of the week
Use the
%a
conversion specifier to include these
strings in formats. [Return to example]
Full names for days of the week
Use the
%A
conversion specifier to include these
strings in formats. [Return to example]
Abbreviated names for months of the year
Use the
%b
conversion specifier to include these
strings in formats. [Return to example]
Full names for months of the year
Use the
%B
conversion specifier to include these
strings in formats. [Return to example]
Format for combined date and time information
The format combines field descriptors as defined for the
strftime()
function.
See
strftime
(3)
for a complete list of field
descriptors.
The specified format includes the field descriptors for the abbreviated
day of the week (%a
), the day of the month (%e
), the number of hours in a 24-hour period (%H
),
the number of minutes (%M
), and the number of seconds (%S
), the time zone (%Z
), and the full representation
of the year (%Y
).
If the date were April 23, 1999 on the
East coast of the United States, the format specified in this example would
cause the
date
command to display
ven 23 avr 13:43:05
EDT 1999
. [Return to example]
Section trailer [Return to example]
The preceding example includes only some of the symbol definitions that
are standard for the
LC_TIME
category.
The following definitions
are also standard:
d_fmt
Format for the date alone; corresponds to the
%x
field descriptor
t_fmt
Format for the time alone; corresponds to the
%X
field descriptor
am_pm
Format for the ante meridiem and post meridiem time strings; corresponds
to the
%p
field descriptor
For example, the definition for English would be:
am_pm "<A><M>";"<P><M>"
t_fmt_ampm
Format for the time according to the 12-hour clock; corresponds to the
%r
field descriptor
era
Definition of how years are counted and displayed for each era in the locale. This format is for countries that use a year-counting system other than the Gregorian calendar. Such countries often use both the Gregorian calendar and a local era system.
era_d_fmt
Format of the date alone in era notation; corresponds to the
%Ex
field descriptor
era_t_fmt
Format of the time alone in era notation; corresponds to the
%EX
field descriptor
era_d_t_fmt
Format of both date and time in era notation; corresponds to the
%Ec
field descriptor
alt_digits
Definition of alternative symbols for digits; corresponds to the
%O
field descriptor
This format is for countries that include alternative symbols in date strings.
As is true for other category sections, you can specify a
copy
statement to include all
LC_TIME
definitions
from another locale.
Note that Tru64 UNIX supports symbols and field
descriptors in addition to those described here.
Refer to
locale
(4)
for more complete
information.
7.3 Building Libraries to Convert Multibyte/Wide-Character Encodings
C library routines rely on a set of special interfaces to convert characters to and from data file encoding and wide-character encoding (internal process code). By default, the C library routines use interfaces that handle only single-byte characters. However, many are defined with entry points that permit use of alternative interfaces for handling multibyte-characters. The interfaces that can be tailored to a locale's codeset are called methods.
Only locales with multibyte codesets must use methods. When a locale uses methods, there are some methods that the locale must supply and other methods that it can optionally supply. A method is required when the corresponding interface is converting characters between data formats and needs codeset-specific logic to do that operation correctly. A method is optional when the corresponding interface is working with data after it has been converted to wide-character format and can apply logic that is valid for both single-byte and multibyte characters.
Methods must be available on the system in a shareable library.
This
library and the functions that implement each method in the library are made
known to the
localedef
command through a
methods
file.
When the
localedef
command processes
the
methods
file along with the
charmap
and
locale
source files, the resulting locale includes
pointers to all methods that are supplied with the locale, along with pointers
to default implementations for optional methods that are not supplied with
the locale.
When you set the
LANG
variable to the newly
built locale and run a command or application, methods are used wherever they
have been enabled in the system software.
7.3.1 Required Methods
If your locale uses methods, it must supply the following methods, without which it is impossible for C Library functions to convert data between multibyte and wide-character formats:
_ _mbstopcs
_ _mbtopc
_ _pcstombs
_ _pctomb
mblen
mbstowcs
mbtowc
wcstombs
wctomb
wcswidth
wcwidth
7.3.1.1 Writing the _ _mbstopcs Method for the fgetws Function
The
fgetws( )
function
uses the
_ _mbstopcs
method to convert the bytes
in the standard I/O (stdio
) buffer to a wide-character
string.
The function that implements this method must return the number of
wide characters converted by the call.
This method is similar to the one for
mbstowcs
(see
Section 7.3.1.6) but contains additional parameters to meet
the needs of
fgetws( )
.
By convention, a C source
file for this method has the file name
_ _mbstopcs_codeset
.c
, where
codeset
identifies the codeset for which the method is tailored.
Example 7-10
shows the file
_ _mbstopcs_sdeckanji.c
that defines the
_ _mbstopcs
method used
with the
ja_JP.sdeckanji
locale.
Example 7-10: The _ _mbstopcs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> [1] #include <sys/localedef.h> [1] int _ _mbstopcs_sdeckanji( wchar_t *pwcs, [2] size_t pwcs_len, [3] const char *s, [4] size_t s_len, [5] int stopchr, [6] char **endptr, [7] int *err, [8] _LC_charmap_t *handle ) [9] { int cnt = 0; [10] int pwcs_cnt = 0; [10] int s_cnt = 0; [10] *err = 0; [11] while (1) { [12] if (pwcs_cnt >= pwcs_len || s_cnt >= s_len) { *endptr = (char *)&(s[s_cnt]); break; } [13] if ((cnt = _ _mbtopc_sdeckanji(&(pwcs[pwcs_cnt]), &(s[s_cnt]), (s_len - s_cnt), err)) == 0) { *endptr = (char *)&(s[s_cnt]); break; } [14] pwcs_cnt++; [15] if ( s[s_cnt] == (char) stopchr) { *endptr = (char *)&(s[s_cnt+1]); break; } [16] s_cnt += cnt; [17] } [18] return (pwcs_cnt); [19] }
Include header files that contain constants and structures required for this method. [Return to example]
Points, through
pwcs
,
to a buffer that stores the wide-character string. [Return to example]
Defines a variable,
pwcs_len
, to store the size of the
pwcs
buffer.
[Return to example]
Points, through
s
,
to a buffer that stores the multibyte-character string being converted.
[Return to example]
Defines a variable,
s_len
,
to store the number of bytes of data in the
s
buffer.
This parameter is needed because the
fgetws( )
function reads from the standard I/O buffer, which does not contain null-terminated
strings. [Return to example]
Defines a variable,
stopchr
,
to contain a byte value that would force conversion to stop.
This value, typically
\n
, is passed to the method
on the call from the
fgetws( )
function, which handles
only one line of input per call. [Return to example]
Defines a variable,
endptr
,
that points to the byte following the last byte converted.
This pointer is needed to specify the starting character in the standard
I/O buffer for the next call to
fgetws( )
.
[Return to example]
Points, through
err
,
to a variable that stores execution status for the call made by this method
to the
mbtopc
method. [Return to example]
Points, through
hdl
,
to a structure that points to the methods that parse character maps for this
locale.
The
localedef
command creates and stores values in
the
_LC_charmap_t
structure. [Return to example]
Initialize variables
that indicate the number of bytes that a character uses in multibyte format
(supplied by the
mbtopc
method) and the byte or character
position in buffers that the
fgetws( )
function
uses. [Return to example]
Sets
err
to zero (0)
to indicate success. [Return to example]
Starts the
while
loop
that converts the multibyte string. [Return to example]
Sets
endptr
and breaks
out of the loop when there is either no more space in the buffer that stores
wide-character data or no more data in the buffer that stores multibyte data.
[Return to example]
Calls the
mbtopc
method
to convert a character from multibyte format to wide-character format; breaks
out of the loop and sets
endptr
to the first byte of the
character that could not be converted if the
mbtopc
method
fails to convert a character and returns an error.
The
err
variable contains the return status of the
call to the
mbtopc
method:
0 indicates success.
-1 indicates an invalid character.
A value greater than 0 indicates that too few bytes remain in the multibyte-character buffer to form a valid character.
In this case, the return is the number of bytes required to form a valid
character.
The
fgetws( )
function can then refill
the buffer and try again.
Increments the character position in the buffer that stores the wide-character data. [Return to example]
Sets
endptr
to the
character following the character stored in
stopchr
if
the
stopchr
character is encountered in the multibyte data.
[Return to example]
Increments the byte position in the buffer that contains multibyte data. [Return to example]
Ends the
while
loop.
[Return to example]
Returns the number of characters in the buffer that contains wide-character data. [Return to example]
The
getwc( )
or
fgetwc( )
function calls the
_ _mbtopc
method to convert a multibyte character to a wide character.
The method returns
the number of bytes in the multibyte character that is converted.
This method
is similar to the one for
mbtowc
(see
Section 7.3.1.7)
but contains an additional parameter that
getwc( )
needs.
By convention, a C source file for this method has the file name
_ _mbtopc_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-11
shows
the
_ _mbtopc_sdeckanji.c
file, which defines the
_ _mbtopc
method used with the
ja_JP.sdeckanji
locale.
Example 7-11: The _ _mbtopc_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: PC = s[0] s[0] = 0x8e: PC = s[1] + 0x5f; s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c s[0] > 0xa1:0xa1 < s[1] < 0xfe PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e 0x21 < s[1] < 0x7e PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int _ _mbtopc_sdeckanji( wchar_t *pwc, [3] char *ts, [4] size_t maxlen, [5] int *err, [6] _LC_charmap_t *handle ) [7] { wchar_t dummy; [8] unsigned char *s = (unsigned char *)ts; [9] if (s == NULL) return(0); [10] if (pwc == (wchar_t *)NULL) pwc = &dummy; [11] *err = 0; [12] if (s[0] <= 0x8d) { if (maxlen < 1) { *err = 1; return(0); } else { *pwc = (wchar_t) s[0]; return(1); } } [13] else if (s[0] == 0x8e) { if (maxlen >= 2) { if (s[1] >=0xa1 && s[1] <=0xfe) { *pwc = (wchar_t) (s[1] + 0x5f); return(2); } } else { *err = 2; return(0); } } [14] else if (s[0] == 0x8f) { if (maxlen >= 3) { if ((s[1] >=0xa1 && s[1] <=0xfe) && (s[2] >=0xa1 && s[2] <= 0xfe)) { *pwc = (wchar_t) (((s[1] - 0xa1) << 7) | (wchar_t) (s[2] - 0xa1)) + 0x303c; return(3); } } else { *err = 3; return(0); } } [15] else if (s[0] <= 0x9f) { if (maxlen < 1) { *err = 1; return(0); } else { *pwc = (wchar_t) s[0]; return(1); } } [16] else if (s[0] >= 0xa1 && s[0] <= 0xfe) { if (maxlen >= 2) { if (s[1] >=0xa1 && s[1] <= 0xfe) { *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t) (s[1] - 0xa1)) + 0x15e; return(2); } else if (s[1] >=0x21 && s[1] <= 0x7e) { *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t) (s[1] - 0x21)) + 0x5f1a; return(2); } } else { *err = 2; return(0); } } [17] *err = -1; return(0); [18] }
Include header files that contain constants and structures required for this method [Return to example]
Describes the algorithm used to determine the number of bytes and valid byte combinations for the different character sets that the codeset supports
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]
Points, through
pwc
, to a buffer that stores
the wide character [Return to example]
Points, through
ts
, to a buffer that stores
the bytes that are passed to the method from the calling function [Return to example]
Declares a variable,
maxlen
, that stores
the maximum number of bytes in the multibyte data
This value is passed by the calling function. [Return to example]
Points, through
err
, to a buffer that stores
execution status [Return to example]
Points, through
handle
, to a structure that
contains pointers to the methods that parse the character maps for this locale
[Return to example]
Declares a variable,
dummy
, to which
pwc
can be set to ensure a valid address [Return to example]
Casts
ts
(an array of signed characters)
to
s
(an array of unsigned characters)
This operation prevents problems when integer values are stored in the
array and then referenced by index.
Compilers apply sign extension to values
when comparing a small signed data type, such as
char
,
to a large signed data type, such as
int
.
Sign extension
means that the high bit of the value in the small data type is used to fill
in bits that remain when the value is converted to the larger data type for
comparison.
For example, if
s[0]
is the value 0x8e, sign
extension would cause it to be treated as 0xffffff8e.
In this case, a condition
like the following is evaluated as true when you expect it to be false:
if (s[0] <= 0x8d
[Return to example]
Returns zero (0) if the
s
buffer contains
or points to NULL [Return to example]
Stores the contents of
dummy
in the wide-character
buffer if the
ts
buffer contains or points to NULL
This operation ensures that
*pwc
always points to
a valid address; otherwise, an application could produce a segmentation fault
by referring to this pointer when a wide character has not been stored in
pwc
. [Return to example]
Initializes
err
to zero (0) to indicate
success [Return to example]
Determines if the character is one of the single-byte characters that the codeset defines for values equal to or less than 0x8d
If
s
contains no characters, returns zero (0) to
indicate that no bytes were converted and sets
err
to 1
to indicate that 1 byte is needed to form a valid character.
If the byte value is in the range being tested, moves the associated
process code value to
pwc
and returns 1 to indicate the
number of bytes converted. [Return to example]
Determines if the character is one of the double-byte characters that the codeset defines for the value 0x8e (first byte) and the value range 0xa1 to 0xfe (second byte)
If yes, moves the associated process code value to the
pwc
buffer and returns 2 to indicate the number of bytes converted;
otherwise, returns 0 to indicate that no conversion took place and sets
err
to 2 to specify that at least 2 bytes are needed to form a valid
character. [Return to example]
Determines if the character is one of the triple-byte characters that the codeset defines for the value 0x8f (first byte), the range 0xa1 to 0xfe (second byte), and the range 0xa1 to 0xfe (third byte)
If yes, moves the associated process code value to
pwc
and returns 3 to indicate the number of bytes converted; otherwise, sets
err
to 3 to indicate that at least 3 bytes are needed and returns
zero (0) to indicate that no character was converted. [Return to example]
Determines if the character is one of the single-byte characters that the codeset defines for the range 0x90 to 0x9f
If there are no bytes in the standard I/O buffer, returns zero (0) to
indicate that no bytes were converted and sets
err
to 1
to indicate that at least 1 byte is needed to form a valid character.
If the byte value is in the defined range, moves the associated process
code value to
pwc
and returns 1 to indicate the number
of bytes converted. [Return to example]
Determines if the character is one of the double-byte characters that the codeset defines for the range 0xa1 to 0xfe (first byte) and 0x21 to 0x7e (second byte)
If yes, moves the associated process code value to
pwc
buffer and returns 2 to indicate the number of bytes converted; otherwise,
sets
err
to 2 to indicate that at least 2 bytes are needed
to form a valid character and returns zero (0) to indicate that no bytes were
converted. [Return to example]
Sets
err
to -1 to indicate that an
invalid multibyte sequence was encountered and returns zero (0) to indicate
that no bytes were converted
These statements execute if the multibyte data in
s
satisfies none of the preceding
if
conditions. [Return to example]
The
fputws( )
function
first calls the
_ _pcstombs
method to convert a
string of characters from process (wide-character) code to multibyte code.
If this method returns -1 to indicate no support by the locale,
fputws( )
then calls
putwc( )
for
each wide character in the string being converted.
By convention, a C source
file for this method has the file name
_ _pcstombs_codeset
.c
, where
codeset
identifies the codeset for which this method is tailored.
Example 7-12
shows the file
_ _pcstombs_sdeckanji.c
that defines the
_ _pcstombs
method used
with the
ja_JP.sdeckanji
locale.
Example 7-12: The _ _pcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale
int _ _pcstombs_sdeckanji() { return -1; [1] }
Returns -1 to indicate that the locale does not support the method.
This return causes the
fputws( )
function
to use multiple calls to
putwc( )
to convert wide
characters in the string. [Return to example]
If you choose to implement this method fully rather than writing it to return -1, your function implementation returns the number of wide characters converted and must include header files and parameters as shown in the following example:
#include <stdlib.h> #include <wchar.h> #include <sys/localedef.h> int _ _pcstombs_newcodeset( wchar_t *pcsbuf, [1] size_t pcsbuf_len, [2] char *mbsbuf, [3] size_t mbsbuf_len, [4] char **endptr, [5] int *err, [6] _LC_charmap_t *handle ) [7]
Specifies a pointer to a buffer that contains the wide-character string [Return to example]
Specifies a variable with the length of the wide-character buffer
This value is passed to the method on the call from
fputws( )
. [Return to example]
Specifies a pointer to a buffer that contains the multibyte-character string [Return to example]
Specifies a variable with the length of the multibyte-character buffer
This value is passed to the method on the call from
fputws( )
. [Return to example]
Points, through
endptr
, to a pointer to
the byte position in the multibyte-character buffer where the next character
would begin if multiple calls to
fputws( )
are required
to convert all the wide-character data [Return to example]
Specifies a pointer to the execution status return
If this method calls the
wctomb
method to perform
the character conversion, the
wctomb
method sets this status.
Otherwise, this method must incorporate the logic to perform wide-character
to multibyte-character conversion and set the status directly.
In any event, the
fputws( )
function expects
the following values:
0 for success
-1 to indicate that the wide-character value is invalid and therefore cannot be converted
A positive value to indicate that the multibyte-character buffer contains too few bytes after the last character to store the next character
In this case, the value is the number of bytes required to store the
next character.
The
fputws( )
function can then
empty the multibyte-character buffer and try again.
Specifies a pointer to the
_LC_charmap_t
structure that stores pointers to the methods used with this locale
[Return to example]
The
_ _pcstombs
method performs the reverse
of the operation that the
_ _mbstopcs
method described
in
Section 7.3.1.3
performs.
Because of the direction
of the data conversion, the
_ _pcstombs
method:
Does not require a variable for a stop conversion character,
such as
\n
Calls (or implements the operation performed by the)
wctomb
method rather than calling the
mbtowc
method to convert each character and determine the number of bytes it needs
in the multibyte-character buffer
7.3.1.4 Writing a _ _pctomb Method
C
Library functions currently do not use the
_ _pctomb
interface.
The
putwc( )
function, for example, calls
the
wctomb
method to convert a character from wide-character
to multibyte-character format.
Nonetheless, the
localedef
command requires a method for this function when your locale supplies methods.
By convention, a C source file for this method has the file name
_ _pctomb_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-13
shows
the
_ _pctomb_sdeckanji.c
file that defines the
_ _pctomb
method used with the
ja_JP.sdeckanji
locale.
Example 7-13: The _ _pctomb_sdeckanji Method for the ja_JP.sdeckanji Locale
int _ _pctomb_sdeckanji() { return -1; [1] }
Returns -1 to indicate that the locale does not support this method [Return to example]
The
mblen( )
function
uses the
mblen
method to return the number of bytes in
a multibyte character.
By convention, a C source file for this method has
the file name
_ _mblen_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-14
shows
the
_ _mblen_sdeckanji.c
file that defines the
mblen
method used with the
ja_JP.sdeckanji
locale.
Example 7-14: The _ _mblen_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: 1 byte s[0] = 0x8e: 2 bytes s[0] = 0x8f 3 bytes s[0] > 0xa1 2 bytes +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int _ _mblen_sdeckanji( char *fs, [3] size_t maxlen, [4] _LC_charmap_t *handle ) [5] { const unsigned char *s = (void *) fs; [6] if (s == NULL || *s == '\0') return(0); [7] if (maxlen < 1) { _Seterrno(EILSEQ); return((size_t)-1); } [8] if (s[0] <= 0x8d) return(1); [9] else if (s[0] == 0x8e) { if (maxlen >= 2 && s[1] >=0xa1 && s[1] <=0xfe) return(2); } [10] else if (s[0] == 0x8f) { if(maxlen >=3 && (s[1] >=0xa1 && s[1] <=0xfe) && (s[2] >=0xa1 && s[2] <= 0xfe)) return(3); } [11] else if (s[0] <= 0x9f) return(1); [12] else if (s[0] >= 0xa1) { if (maxlen >=2 && (s[0] <= 0xfe) ) if ( (s[1] >=0xa1 && s[1] <= 0xfe) || (s[1] >=0x21 && s[1] <= 0x7e) ) return(2); } [13] _Seterrno(EILSEQ); return((size_t)-1); [14] }
Includes header files that contain constants and structures required by this method [Return to example]
Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]
Points, through
fs
, to a buffer that stores
the byte string to be examined [Return to example]
Defines a variable,
maxlen
, that stores
the maximum length of a multibyte character
This value is passed to the method by the
mblen( )
function. [Return to example]
Points, through
handle
, to a structure that
stores pointers to the methods that parse character maps for this locale
[Return to example]
Casts
fs
(an array of signed characters)
to
s
(an array of unsigned characters).
This operation prevents problems when integer values are stored in the
array and then referenced by index.
Compilers apply sign extension to values
when comparing a small signed data type, such as
char
,
to a large signed data type, such as
int
.
Sign extension
means that the high bit of the value in the small data type is used to fill
in bits that remain when the value is converted to the larger data type for
comparison.
For example, if
s[0]
is the value 0x8e, sign
extension would cause it to be treated as 0xffffff8e.
In this case, a condition
like the following is evaluated as true when you expect it to be false:
if (s[0] <= 0x8d
[Return to example]
Returns zero (0) to indicate that the character length is zero
(0) bytes if
s
contains or points to NULL [Return to example]
Returns -1 and sets
errno
to
[EILSEQ]
(invalid character sequence) if
maxlen
(the maximum number of bytes to consider) is 0 or a negative
number
To set
errno
in a way that works correctly with multithreaded applications,
use
_Seterrno
rather than an assignment statement.
[Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d
If yes, returns 1 to indicate that the character length is 1 byte. [Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe
If yes, returns 2 to indicate that the character length is 2 bytes. [Return to example]
Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe
If yes, returns 3 to indicate that the character length is 3 bytes. [Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f
If yes, returns 1 to indicate that the character length is 1 byte. [Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains a value in the range 0xa1 to 0xfe and whose second byte contains a value in the range 0x21 to 0x7e
If yes, returns 2 to indicate that the character length is 2 bytes. [Return to example]
Returns -1 and sets
errno
to
[EILSEQ]
to indicate an invalid multibyte sequence
These statements execute if the multibyte data in the standard I/O buffer
satisfies none of the preceding
if
conditions. [Return to example]
The
mbstowcs( )
function uses the
mbstowcs
method to convert a multibyte
character string to process wide-character code and to return the number of
resultant wide characters.
By convention, a C source file for this method
has the file name
_ _mbstowcs_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-15
shows
the
_ _mbstowcs_sdeckanji.c
file that defines the
mbstowcs
method used with the
ja_JP.sdeckanji
locale.
Example 7-15: The _ _mbstowcs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/localedef.h> size_t _ _mbstowcs_sdeckanji( wchar_t *pwcs, [2] const char *s, [3] size_t n, [4] _LC_charmap_t *handle ) [5] { int len = n; [6] int rc; [7] int cnt; [8] wchar_t *pwcs0 = pwcs; [9] int mb_cur_max; [10] if (s == NULL) return (0); [11] mb_cur_max = MB_CUR_MAX; [12] if (pwcs == (wchar_t *)NULL) { cnt = 0; while (*s != '\0') { if ((rc = _ _mblen_sdeckanji(s, mb_cur_max, handle)) == -1) return(-1); cnt++ ; s += rc; } return(cnt); } [13] while (len-- > 0) { if ( *s == '\0') { *pwcs = (wchar_t) '\0'; return (pwcs - pwcs0); } if ((cnt = _ _mbtowc_sdeckanji(pwcs, s, mb_cur_max, handle)) < 0) return(-1); s += cnt; ++pwcs; } [14] return (n); [15] }
Includes header files that contain constants and structures required for this method [Return to example]
Points, through
pwcs
, to a buffer that contains
the wide-character string [Return to example]
Points, through
s
, to a buffer that contains
the multibyte-character string [Return to example]
Defines a variable,
n
, that contains the
number of wide characters in
pwcs
[Return to example]
Points, through
handle
, to a structure that
stores pointers to the methods that parse character maps for this locale
[Return to example]
Assigns the number of wide characters in the
pwcs
buffer (the
n
value supplied by the calling
function) to
len
[Return to example]
Defines a variable,
rc
, that stores the
return count from a call this method makes to the
mblen
function [Return to example]
Defines a variable,
cnt
, that counts the
bytes used by characters in the
s
buffer [Return to example]
Saves the start of the wide-character string passed by the
calling function in the
pwcs0
variable [Return to example]
Defines a variable,
mb_cur_max
, that is
later set to
MB_CUR_MAX
and used in a call to the
mblen
method [Return to example]
Returns zero (0) if
s
is null
A method should return zero (0) if the locale's character encoding is stateless and a nonzero value if the locales's character encoding is stateful. [Return to example]
Assigns the value defined for
MB_CUR_MAX
to
mb_cur_max
for use on the following call to the
mblen
method [Return to example]
Checks to see if a null pointer was passed from the calling
function and, if yes, calls the
mblen
method to calculate
the size of the wide-character string
The programmer can request the size of the
pwcs
buffer
(for memory allocation purposes) by passing a null wide character as the
pwcs
parameter in the call to
mbstowcs( )
.
The programmer can then use the return value to efficiently allocate memory
space for the application's wide-character buffer before calling
mbstowcs( )
again to actually convert the multibyte string.
[Return to example]
Converts bytes in the multibyte-character buffer by calling
the
_ _mbtowc
method until a null character (end-of-string)
is encountered
Stops processing and returns the number of wide characters in the
pwcs
buffer if a NULL character is encountered; increments the byte
position in the multibyte character buffer by an appropriate number each time
a character is successfully converted
This
while
loop uses the condition
len-- > 0
to ensure that processing stops when the
pwcs
buffer is full.
The first
if
condition in the loop makes
sure that, if the multibyte string in the
s
buffer is null
terminated, the associated null terminator in the
pwcs
buffer is not included in the wide-character count that the
mbtowcs( )
function returns to the application. [Return to example]
Returns the value in
n
to indicate the resultant
number of wide characters in the
pwcs
buffer
This statement executes if the
pwcs
buffer runs out
of space before a NULL is encountered in the
s
buffer.
[Return to example]
The
mbtowc()
function uses the
mbtowc
method to convert a multibyte character to a wide character and
to return the number of bytes in the multibyte character that was converted.
By convention, a C source file for this method has the file name
_ _mbtowc_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-16
shows
the
_ _mbtowc_sdeckanji.c
file that defines the
mbtowc
method used with the
ja_JP.sdeckanji
locale.
Example 7-16: The _ _mbtowc_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: PC = s[0] s[0] = 0x8e: PC = s[1] + 0x5f; s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c s[0] > 0xa1:0xa1 < s[1] < 0xfe PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e 0x21 < s[1] < 0x7e PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int _ _mbtowc_sdeckanji( wchar_t *pwc, [3] const char *ts, [4] size_t maxlen, [5] _LC_charmap_t *handle ) [6] { unsigned char *s = (unsigned char *)ts; [7] wchar_t dummy; [8] if (s == NULL) return(0); [9] if (maxlen < 1) { _Seterrno(EILSEQ); return((size_t)-1); } [10] if (pwc == (wchar_t *)NULL) pwc = &dummy; [11] if (s[0] <= 0x8d) { *pwc = (wchar_t) s[0]; if (s[0] != '\0') return(1); else return(0); } [12] else if (s[0] == 0x8e) { if ( (maxlen >= 2) && ((s[1] >=0xa1) && (s[1] <=0xfe))) { *pwc = (wchar_t) (s[1] + 0x5f); /* 0x100 - 0xa1 */ return(2); } } [13] else if (s[0] == 0x8f) { if((maxlen >= 3) && (((s[1] >=0xa1) && (s[1] <=0xfe)) && ((s[2] >=0xa1) && (s[2] <= 0xfe)))) { *pwc = (wchar_t) (((s[1] - 0xa1) << 7) | (wchar_t) (s[2] - 0xa1)) + 0x303c; return(3); } } [14] else if (s[0] <= 0x9f) { *pwc = (wchar_t) s[0]; if (s[0] != '\0') return(1); else return(0); } [15] else if (((s[0] >= 0xa1) && (s[0] <= 0xfe)) && (maxlen >= 2)){ if (((s[1] >=0xa1) && (s[1] <= 0xfe))){ *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t)(s[1] - 0xa1)) + 0x15e; return(2); } else if (((s[1] >=0x21) && (s[1] <= 0x7e))){ *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t)(s[1] - 0x21)) + 0x5f1a; return(2); } } [16] _Seterrno(EILSEQ); return(-1); [17] }
Includes header files that contain constants and structures required for this method [Return to example]
Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid. [Return to example]
Points, through
pwc
, to a buffer that contains
the wide character [Return to example]
Points, through
ts
, to a buffer that contains
values in multibyte-character format [Return to example]
Defines a variable,
maxlen
, that stores
the maximum length of a multibyte character
This value is passed from the calling function; the value will have
been set to
MB_CUR_MAX
on the original call made by the
application programmer. [Return to example]
Points, through
handle
, to a structure that
stores pointers to the methods that parse character maps for this locale
[Return to example]
Casts
ts
(an array of signed characters)
to
s
(an array of unsigned characters)
This operation prevents problems when integer values are stored in the
array and then referenced by index.
Compilers apply sign extension to values
when comparing a small signed data type, such as
char
,
to a large signed data type, such as
int
.
Sign extension
means that the high bit of the value in the small data type is used to fill
in bits that remain when the value is converted to the larger data type for
comparison.
For example, if
s[0]
is the value 0x8e, sign
extension would cause it to be treated as 0xffffff8e.
In this case, a condition
like the following one would be evaluated as true when you would expect it
to be false:
if (s[0] <= 0x8d
[Return to example]
Defines a variable,
dummy
, that can be assigned
to
pwc
to ensure
pwc
points to a valid
address [Return to example]
Returns zero (0) to indicate that the locale's character encoding
is stateless if
s
contains or points to NULL
If passed a null pointer, this method should return a value to indicate whether the locale's character encoding is stateful or stateless. Return a nonzero value if your locale's character encoding is stateful. [Return to example]
Returns -1 cast to
size_t
and sets
errno
to
[EILSEQ]
(invalid byte
sequence) if the multibyte data buffer is less than 1 byte in length
[Return to example]
Stores the contents of
dummy
in the wide-character
buffer if the
ts
buffer contains or points to NULL
This operation ensures that
pwc
always points to
a valid address; otherwise, an application could produce a segmentation fault
by referring to this pointer when a wide character has not been stored in
pwc
. [Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d
If yes, stores the associated process code value in the
pwc
buffer and returns 1 to indicate that the character length is 1
byte [Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe
If yes, stores the associated process code value in the
pwc
buffer and returns 2 to indicate that the character length is 2
bytes [Return to example]
Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe
If yes, stores the associated process code value in the
pwc
buffer and returns 3 to indicate that the character length is 3
bytes [Return to example]
Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f
If yes, stores the associated process code value in the
pwc
buffer and returns 1 to indicate that the character length is 1
byte [Return to example]
Determines if the first byte identifies a double-byte character whose first byte contains a value in the range x0a1 to x0fe and whose second byte contains a value in the range 0x21 to 0x7e
If yes, stores the associated process code value in the
pwc
buffer and returns 2 to indicate that the character length is 2
bytes [Return to example]
Returns -1 and sets
errno
to
[EILSEQ]
to indicate that an invalid multibyte sequence
was encountered
These statements execute if the multibyte data in the
s
buffer satisfies none of the preceding
if
conditions.
[Return to example]
The
wcstombs( )
function calls the
wcstombs
method to convert a wide-character
string to a multibyte-character string and to return the number of bytes in
the resultant multibyte-character string.
By convention, a C source file for
this method has the file name
_ _wcstombs_codeset
.c
, where
codeset
identifies the codeset for which this method is tailored.
Example 7-17
shows the
_ _wcstombs_sdeckanji.c
file that defines the
wcstombs
method used with
the
ja_JP.sdeckanji
locale.
Example 7-17: The _ _wcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <limits.h> #include <sys/localedef.h> size_t _ _wcstombs_sdeckanji( char *s, [2] const wchar_t *pwcs, [3] size_t n, [4] _LC_charmap_t *handle ) [5] { int cnt=0; [6] int len=0; [7] int i=0; [8] char tmps[MB_LEN_MAX+1]; [9] if ( s == (char *)NULL) { cnt = 0; while (*pwcs != (wchar_t)'\0') { if ((len = _ _wctomb_sdeckanji(tmps, *pwcs)) == -1) return(-1); cnt += len; pwcs++; } return(cnt); } [10] if (*pwcs == (wchar_t)'\0') { *s = '\0'; return(0); } [11] while (1) { [12] if ((len = _ _wctomb_sdeckanji(tmps, *pwcs)) == -1) return(-1); [13] else if (cnt+len > n) { *s = '\0'; break; } [14] if (tmps[0] == '\0') { *s = '\0'; break; } [15] for (i=0; i<len; i++) { *s = tmps[i]; s++; } [16] cnt += len; [17] if (cnt == n) break; [18] pwcs++; [19] } [20] if (cnt == 0) cnt = len; [21] return (cnt); [22] }
Includes header files that contain constants and structures required for this method [Return to example]
Points, through
s
, to a buffer that stores
the multibyte-character string that this method passes to the calling function
[Return to example]
Points, through
pwcs
, to a buffer that stores
the wide-character string that is being converted [Return to example]
Defines a variable,
n
, that stores the number
of maximum number of bytes in the multibyte-character string buffer
This value is supplied by the calling function. [Return to example]
Points, through
handle,
to a structure that
points to the methods that parse character maps for this locale [Return to example]
Initializes a variable,
cnt
, that is incremented
by the number of bytes (len
) of each converted character
[Return to example]
Initializes a variable,
len
, that stores
the length of each converted character [Return to example]
Initializes a variable,
i
, that is used
to index the bytes in each multibyte character when moving a converted character
from temporary storage to
s
[Return to example]
Defines a temporary buffer,
tmps
, that stores
the multibyte character returned to this method from a call to the
wctomb
method [Return to example]
Checks to see if a NULL was passed from the calling function
in the
s
buffer
If yes, calls the
wctomb
method to calculate the
number of bytes required for converted characters (excluding the null terminator)
in the multibyte-character buffer
The programmer can request the size of the
s
buffer
(for memory allocation purposes) by passing a null byte as the data in the
s
parameter on the call to
wcstombs( )
.
The programmer can then use the return value to efficiently allocate memory
space for the application's wide-character buffer before calling
wcstombs( )
again to actually convert the wide-character
string. [Return to example]
Returns zero (0) to indicate that no multibyte characters resulted
and sets
s
to NULL if
pwcs
points to
NULL [Return to example]
Starts a
while
loop to process characters
in the wide-character string [Return to example]
Converts characters in the wide-character buffer by calling
the
wctomb
method; returns -1 to indicate an invalid
character if
wctomb
returns -1 [Return to example]
Terminates
s
with NULL and breaks out of
the
while
loop if there is no room in
s
for the character just converted by
wctomb
[Return to example]
Moves a null terminator to
s
and breaks
out of the
while
loop when a NULL is encountered in
s
[Return to example]
Appends each byte in
tmps
to
s
if the current wide character is not a null [Return to example]
Increments
cnt
by the number of bytes (len
) occupied by this character in multibyte format [Return to example]
Breaks out of the
while
loop without adding
a null terminator if the number of bytes processed equals
n
(the maximum number of bytes in
s
) [Return to example]
Increments
pwcs
to point to the next wide
character to be converted [Return to example]
Ends the
while
loop that converts each wide
character [Return to example]
Ensures that zero (0) is returned if
s
does
not contain enough space for even one character [Return to example]
Returns the number of bytes in the resultant multibyte-character string [Return to example]
The
wctomb( )
function
calls the
wctomb
method to convert a wide character to
a multibyte character and to return the number of bytes in the resultant multibyte
character.
By convention, a C source file for this method has the file name
_ _wctomb_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-18
shows
the
_ _wctomb_sdeckanji.c
file that defines the
wctomb
method for the
ja_JP.sdeckanji
locale.
Example 7-18: The _ _wctomb_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: PC <= 0x009f: s[0] = PC PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e s[1] = PC - 0x005f PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1 s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1 PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f s[1] = ((PC - 0x303c) >> 7) + 0x00a1 s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1 PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1 s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021 +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int _ _wctomb_sdeckanji( char *s, [3] wchar_t wc, [4] _LC_charmap_t *handle ) [5] { if (s == (char *)NULL) return(0); [6] if (wc <= 0x9f) { s[0] = (char) wc; return(1); } [7] else if ((wc >= 0x0100) && (wc <= 0x015d)) { s[0] = 0x8e; s[1] = wc - 0x5f; return(2); } [8] else if ((wc >=0x015e) && (wc <= 0x303b)) { s[0] = (char) (((wc - 0x015e) >> 7) + 0x00a1); s[1] = (char) (((wc - 0x015e) & 0x007f) + 0x00a1); return(2); } [9] else if ((wc >=0x303c) && (wc <= 0x5f19)) { s[0] = 0x8f; s[1] = (char) (((wc - 0x303c) >> 7) + 0x00a1); s[2] = (char) (((wc - 0x303c) & 0x007f) + 0x00a1); return(3); } [10] else if ((wc >=0x5f1a) && (wc <= 0x8df7)) { s[0] = (char) (((wc - 0x5f1a) >> 7) + 0x00a1); s[1] = (char) (((wc - 0x5f1a) & 0x007f) + 0x0021); return(2); } [11] _Seterrno(EILSEQ); return(-1); [12] }
Includes header files that contain constants and structures required for this method [Return to example]
Describes the conversion algorithm that this method uses
Each character set supported by the codeset corresponds to a unique range of wide-character (process code) values and, within each character set, multibyte characters are of uniform length (1, 2, or 3 bytes). Therefore, the range in which each wide-character value falls indicates the number of bytes required for the character in multibyte format; the wide-character value itself determines the specific byte value or values for the character in multibyte format. [Return to example]
Points, through
s
, to a buffer that stores
the multibyte character [Return to example]
Defines the
wc
variable that stores the
wide character [Return to example]
Points, through
handle
, to a structure that
stores pointers to the methods that parse the character maps for this locale
[Return to example]
Returns zero (0) to indicate that no characters were converted
if
s
points to NULL [Return to example]
If the wide-character value is equal to or less than 0x9f,
moves that value into the first byte of the
s
array and
returns 1 to indicate that the converted character is 1 byte in length
[Return to example]
If the wide-character value is in the range 0x0100 to 0x015d,
moves the value 0x8e to the first byte and a calculated value to the second
byte of the
s
array; returns 2 to indicate that the converted
character is 2 bytes in length [Return to example]
If the wide-character value is in the range 0x015e to 0x303b,
moves calculated values to the first and second bytes of the
s
array and returns 2 to indicate that the converted character is 2 bytes in
length [Return to example]
If the wide-character value is in the range 0x303c to 0x5f19,
moves 0x8f to the first byte and calculated values to the second and third
bytes of the
s
array; returns 3 to indicate that the converted
character is 3 bytes in length [Return to example]
If the wide-character value is in the range 0x5f1a to 0x8df7,
moves calculated values to the first and second bytes of the
s
array, and returns 2 to indicate that the converted character is 2 bytes in
length [Return to example]
Sets
errno
to
[EILSEQ]
and returns -1 to indicate that the wide-character
value is invalid
These statements execute if the wide-character values satisfy none of the preceding conditions. [Return to example]
The
wcswidth( )
function uses the
wcswidth
method to determine the number
of columns required to display a wide-character string.
By convention, a C
source file for this method has the file name
_ _wcswidth_codeset
.c
, where
codeset
identifies the codeset for which this method is tailored.
Example 7-19
shows the
_ _wcswidth_sdeckanji.c
file that defines the
wcswidth
method used for
the
ja_JP.sdeckanji
locale.
Example 7-19: The _ _wcswidth_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/localedef.h> /* The algorithm for this conversion is: PC <= 0x009f: s[0] = PC PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e s[1] = PC - 0x005f PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1 s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1 PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f s[1] = ((PC - 0x303c) >> 7) + 0x00a1 s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1 PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1 s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021 +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int _ _wcswidth_sdeckanji( const wchar_t *wcs, [3] size_t n, [4] _LC_charmap_t *hdl ) [5] { int len; [6] int i; [7] if (wcs == (wchar_t *)NULL || *wcs == (wchar_t)NULL) return(0); [8] len = 0; [9] for (i=0; wcs[i] != (wchar_t)NULL && i<n; i++) { [10] if (wcs[i] <= 0x9f) len += 1; [11] else if ((wcs[i] >= 0x0100) && (wcs[i] <= 0x015d)) len += 1; [12] else if ((wcs[i] >=0x015e) && (wcs[i] <= 0x303b)) len += 2; [13] else if ((wcs[i] >=0x303c) && (wcs[i] <= 0x5f19)) len += 2; [14] else if ((wcs[i] >=0x5f1a) && (wcs[i] <= 0x8df7)) len += 2; [15] else return(-1); [16] } [17] return(len); [18] }
Includes header files that contain constants and structures required for this method [Return to example]
Describes the algorithm used to determine the required display width
Note that each character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns. [Return to example]
Points, through
wcs
, to a buffer that stores
the wide-character string for which display width information is requested
[Return to example]
Defines a variable,
n
, that stores the maximum
size of the
wcs
buffer [Return to example]
Points, through
hdl
, to a structure that
stores pointers to the methods that parse character maps for this locale
[Return to example]
Defines a variable,
len
, that stores the
display width in bytes/columns [Return to example]
Defines a variable,
i
, that functions as
a loop counter [Return to example]
Returns zero (0) if
wcs
contains or points
to NULL [Return to example]
Initializes
len
to zero (0) [Return to example]
Begins a
for
loop that processes each wide
character in the
wcs
buffer and increments the wide-character
pointer [Return to example]
Increments
len
by 1 if the value of the
current wide character is less than or equal to 0x9f [Return to example]
Increments
len
by 1 if the value of the
current wide character is in the range 0x0100 to 0x015d [Return to example]
Increments
len
by 2 if the value of the
current wide character is in the range 0x015e to 0x303b [Return to example]
Increments
len
by 2 if the value of the
current wide character is in the range 0x303c to 0x5f19 [Return to example]
Increments
len
by 2 if the value of the
current wide character is in the range 0x5f1a to 0x8df7 [Return to example]
Returns -1 to indicate that the string contains an invalid wide character
This statement executes if a value that satisfies none of the preceding
conditions is encountered in the string.
The calling function,
wcswidth( )
, also returns -1 if the wide character is nonprintable; however,
this condition is evaluated at the level of the calling function and does
not need to be evaluated by the method. [Return to example]
Ends the
for
loop that processes wide characters
in the
wcs
buffer [Return to example]
Returns
len
to indicate the number of columns
required to display the wide-character string [Return to example]
The
wcwidth( )
function
uses the
wcwidth
method to determine the number of columns
required to display a wide character.
By convention, a C source file for this
method has the file name
_ _wcwidth_codeset
.c
, where
codeset
identifies the codeset
for which this method is tailored.
Example 7-20
shows
the
_ _wcwidth_sdeckanji.c
file that defines the
wcwidth
method used with the
ja_JP.sdeckanji
locale.
Example 7-20: The _ _wcwidth_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> [1] #include <wchar.h> #include <sys/localedef.h> /* The algorithm for this conversion is: PC <= 0x009f: s[0] = PC PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e s[1] = PC - 0x005f PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1 s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1 PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f s[1] = ((PC - 0x303c) >> 7) + 0x00a1 s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1 PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1 s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021 +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ [2] int _ _wcwidth_sdeckanji( wint_t wc, [3] _LC_charmap_t *hdl ) [4] { if (wc == 0) return(0); [5] if (wc <= 0x9f) return(1); [6] else if ((wc >= 0x0100) && (wc <= 0x015d)) return(1); [7] else if ((wc >=0x015e) && (wc <= 0x303b)) return(2); [8] else if ((wc >=0x303c) && (wc <= 0x5f19)) return(2); [9] else if ((wc >=0x5f1a) && (wc <= 0x8df7)) return(2); [10] return(-1); [11] }
Includes header files that contain constants and structures required for this method [Return to example]
Describes the algorithm used to determine the required display width
Note that a character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns. [Return to example]
Defines the
wc
variable that stores the
wide character for which display width information is requested [Return to example]
Points, through
hdl
, to a structure that
stores pointers to the methods that parse character maps for this locale
[Return to example]
Returns zero (0) if the wide-character buffer is empty [Return to example]
Returns 1 if the wide-character value is less than or equal to 0x009f [Return to example]
Returns 1 if the wide-character value is in the range 0x0100 to 0x015d [Return to example]
Returns 2 if the wide-character value is in the range 0x015e to 0x303b [Return to example]
Returns 2 if the wide-character value is in the range 0x303c to 0x5f19 [Return to example]
Returns 2 if the wide-character value is in the range 0x5f1a to 0x8df7 [Return to example]
Returns -1 if the wide-character value is invalid
The calling function,
wcwidth( )
, also returns -1
if the wide character is nonprintable; however, this condition is evaluated
at the level of the calling function and does not need to be evaluated by
the method. [Return to example]
A locale can include methods in addition to those discussed in
Section 7.3.1.
If your locale uses methods but does not supply
any for the functions associated with particular locale categories or some
other locale-related functions, the
localedef
command applies
default methods that handle process code for both single-byte and multibyte
characters.
The following list names the optional methods:
LC_CTYPE
category
towupper
towlower
wctype
iswctype
LC_COLLATE
category
fnmatch
strcoll
strxfrm
wcscoll
wcsxfrm
regcomp
regexec
regfree
regerror
LC_MONETARY
,
LC_NUMERIC
,
or both categories
localeconv
strfmon
LC_TIME
category
strftime
strptime
wcsftime
LC_MESSAGES
rpmatch
Miscellaneous use
nl_langinfo
Writing optional methods requires detailed information about the internal
interfaces to C library routines.
This information is vendor proprietary and
may be subject to change.
In the rare cases where your locale must include
an optional method, contact your technical support representative to request
information.
7.3.3 Building a Shareable Library to Use with a Locale
Example 7-21
shows the compiler and linker command lines that are required to build the
method source files into a shareable library that is used with the
ja_JP.sdeckanji
locale.
Example 7-21: Building a Library of Methods Used with the ja_JP.sdeckanji Locale
cc -std0 -c \ _ _mblen_sdeckanji.c _ _mbstopcs_sdeckanji.c \ _ _mbstowcs_sdeckanji.c _ _mbtopc_sdeckanji.c \ _ _mbtowc_sdeckanji.c _ _pcstombs_sdeckanji.c \ _ _pctomb_sdeckanji.c _ _wcstombs_sdeckanji.c \ _ _wcswidth_sdeckanji.c _ _wctomb_sdeckanji.c \ _ _wcwidth_sdeckanji.c ld -shared -set_version osf.1 -soname libsdeckanji.so -shared \ -no_archive -o libsdeckanji.so \ _ _mblen_sdeckanji.o _ _mbstopcs_sdeckanji.o \ _ _mbstowcs_sdeckanji.o _ _mbtopc_sdeckanji.o \ _ _mbtowc_sdeckanji.o _ _pcstombs_sdeckanji.o _ _pctomb_sdeckanji.o \ _ _wcstombs_sdeckanji.o _ _wcswidth_sdeckanji.o _ _wctomb_sdeckanji.o \ _ _wcwidth_sdeckanji.o \ -lc
Refer to
cc
(1)
and
ld
(1)
for more information about the
cc
and
ld
commands and how you build shared libraries.
7.3.4 Creating a methods File for a Locale
The
methods
file contains an entry for each function that is defined in the methods shared
library for use with the locale.
The operation performed by the function is
identified by a method keyword, followed by quoted strings with the name of
the function and the path to the shared library that contains the function.
Example 7-22
shows the section of a
methods
file for the methods used with the
ja_JP.sdeckanji
locale.
Because there is a mandatory list of methods that you must define
if you want to override any C library interfaces, your
methods
file must always specify an entry for each of the required methods as shown
in this example.
The
ja_JP.sdeckanji
locale relies on default
implementations for all optional methods, so
Example 7-22
does not contain entries for any of the optional methods.
Example 7-22: The methods File for the ja_JP.sdeckanji Locale
# sdeckanji.m [1] # <method_keyword> "<entry>" "<package>" "<library_path>" [1] METHODS [2] _ _mbstopcs "_ _mbstopcs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] _ _mbtopc "_ _mbtopc_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] _ _pcstombs "_ _pcstombs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] _ _pctomb "_ _pctomb_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] mblen "_ _mblen_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] mbstowcs "_ _mbstowcs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] mbtowc "_ _mbtowc_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wcstombs "_ _wcstombs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wcswidth "_ _wcswidth_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wctomb "_ _wctomb_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] wcwidth "_ _wcwidth_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" [3] END METHODS [4]
Comment lines
These lines specify the name of the methods file
and the format of method entries.
Note that the field identified in the format
as
<package>
is ignored, but you must specify some string
for this field in order to specify a library path. [Return to example]
Header to mark start of method entries [Return to example]
Entries for required methods [Return to example]
Trailer to mark end of method entries [Return to example]
Refer to
localedef
(1)
for detailed information about
methods
file entries.
7.4 Building and Testing the Locale
Use the
localedef
command
to build a locale from its source files.
Example 7-23
shows the command line needed to build the French locale used in most examples
in this chapter.
Assume for this example that all source files reside in the
user's default directory and that the resulting locale is also created in
that directory.
Example 7-23: Building the fr_FR.ISO8859-1@example Locale
% localedef -f ISO8859-1.cmap \ [1] -i fr_FR.ISO8859-1.src \ [2] fr_FR.ISO8859-1@example [3]
The
-f
option specifies
the character map source file. [Return to example]
The
-i
option specifies
the locale definition source file. [Return to example]
The final argument to the command is the name of the locale. [Return to example]
When you are testing locales,
particularly ones that are similar to standard locales installed on the system,
you should add an extension to the locale name.
Varying names with the at
(@
) extension allows you to specify the standard strings
for language, territory, and codeset and still be sure that the test locale
is uniquely identified.
This is important if you later decide to move the
locale to the
/usr/lib/nls/loc
directory where other locales
reside.
Example 7-23
shows only one form and a few options
for the
localedef
command.
The
localedef
(1)
reference
page is a complete description of the command.
The following is a summary
of some important rules and options:
If
you defined methods for your locale, you must specify the
methods
file with the
-m
option.
For example, the command
line that builds the
ja_JP.sdeckanji
locale would include
-m sdeckanji.m
to identify the file shown in
Example 7-22.
You can use the
-v
option to run the command
in verbose mode for debugging purposes.
This option, when used with the
-c
option, creates a
.c
file that contains useful
information about the locale.
Use the
-w
option if you want the command to display warnings when it encounters duplicate
definitions.
By default, locales must reside
in the
/usr/lib/nls/loc
directory to be found.
If you want
to test your locale before moving it to the
/usr/lib/nls/loc
directory, you can define the
LOCPATH
variable to specify
the directory where your locale is located.
You can then define the
LANG
environment variable to be your new locale and interactively
test the locale with commands and applications.
Example 7-24
uses the
date
command
to test the date/time format.
Example 7-24: Setting the LOCPATH Variable and Testing a Locale
%
setenv LOCPATH ~harry/locales
%
setenv LANG fr_FR.ISO8859-1@example
%
date
ven 23 avr 13:43:05 EDT 1999
Note
The
LOCPATH
variable is an extension to specifications in the X/Open UNIX standard and therefore may not be recognized on all systems that conform to this standard.
Some programs have support files that are installed in system directories
with names that exactly match the names of standard locales.
In such cases,
application software, system software, or both might use the value of the
LANG
environment variable to determine the locale-specific directory
in which the support files reside.
If assigned directly to the
LANG
or
LC_ALL
environment variable, locale file
names with an at (@) suffix may result in invalid search paths for some applications.
The following example shows how you can work around this problem by assigning
the standard locale name to the
LANG
variable and the name
of your variant locale to the locale category variables.
You need to make
assignments only to those category variables that represent areas where your
locale differs from the locale on which it is based.
%
setenv LANG fr_FR.ISO8859-1
%
setenv LC_CTYPE fr_FR.ISO8859-1@example
%
setenv LC_COLLATE fr_FR.ISO8859-1@example
.
.
.%
setenv LC_TIME fr_FR.ISO8859-1@example