The charmap
(4) reference page explains the format and rules
for this file. This chapter includes a charmap example that conforms
to binary character encodings specified for the ISO Latin-1 codeset, which
defines all characters as single 8-bit bytes. The chapter also includes an
example that shows part of a charmap file for the SJIS codeset,
which defines both single-byte and multibyte characters.
The locale
(4) reference page explains the rules and format for
this file. This chapter develops a locale named de_DE.ISO8859-1@example that supports the language and customs of Germany.
These files are required when the charmap file defines multibyte characters; otherwise, the files are optional. The methods file specifies the shareable library that contains redefinitions of the C Library interfaces that convert data to and from internal process (wide-character) encoding.
By default, the comment character
is the number sign (#). You can override this default with a <comment_char> definition (see 2).
This example provides entries for all valid declarations and specifies
default values for all but <code_set_name>. Usually, you specify
a declaration only when you want to override
its
default value. In this example, the declarations for <comment_char> and <escape_char> specify the default values for the
comment character and escape character, respectively. The value for <mb_cur_max>, the maximum length (in bytes) of a character, is 1
for this particular locale. The value for <mb_cur_min>, the
minimum length (in bytes) of a character, must be 1 in all locales. (All locales
include characters in the Portable Character Set, which defines single-byte
characters.)
The <code_set_name>
value will be the value returned on the nl_langinfo(CODESET) call
made by applications that bind to the locale at run time.
Each character map
consists of a symbolic name and encoding. The name and encoding are separated
by one or more spaces
A symbolic name begins with the left angle bracket (<) and ends with
the right angle bracket (>). The characters between the angle brackets can
be any characters from the Portable Character Set, except for control and
space characters. If the name includes more than one right angle bracket (>),
all but the last one must be preceded by the value of <escape_character>. A symbolic name cannot exceed 128 bytes in length.
An encoding can be
one or more decimal, octal, or hexadecimal constants. (Multiple constants
apply to multibyte encodings.) The constants have the following formats:
\dnnn or \dnn , where n is a decimal digit
\xnn, where n is a hexadecimal digit
\nnn or \nn, where n is an octal digit
You can create multiple symbolic names for the same character (encoding).
In this source file, for example, the backspace character (value \d008) has two symbolic names, <BS> and <backspace>. When more than one symbolic name exists for a character, you can specify
any of them in locale definition source files to refer to the character.
The source files for codesets with multibyte characters have more complex
character maps. Example 7-2 shows a subset of character
map entries from a source file for the Japanese SJIS codeset. This source
file specifies entries from several character sets that must be supported
within the same codeset.
This value must be 1.
In SJIS, the largest multibyte character is 2 bytes in length.
Note how character symbols are specified as a range and how two hexadecimal
values determine the encoding for a 2-byte character.
When symbols are specified as a range of symbol values, the specified
character encoding applies to the first symbol in the range. The localedef command automatically increments both the symbol value and the encoding
value to create symbols and encodings for all characters in the range.
These maps establish ranges of encodings for which users can later define
characters.
Refer to the
The symbolic names for characters shown in this example are not necessarily
the names being proposed for adoption by any standards group.
The number sign
(#) is the default comment character. You can specify comments
as entire lines by entering the comment character in the first column of the
line. You cannot specify comments on the same lines as definition statements
in locale source files. In this respect, locale source files differ from character
map source files.
You can override the default comment character with an entry line that
begins with the comment_char keyword, followed by the symbol for
the desired character. The character symbol is defined in the character map
(charmap) source file for the locale.
The escape character, by default the backslash (\), is used
in decimal, hexadecimal, and octal constants and to indicate when definition
statements are continued to the next line of the source file. You can override
the default escape character with an entry line that begins with the escape_char keyword, followed by one or more blank characters, then
the symbol for the desired character. The character symbol is defined in the
character map source file for the locale.
Section headers correspond to category names, which are LC_CTYPE, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_MESSAGES, and LC_TIME.
The format of these statements varies from one category to the next.
In general, a statement begins with a keyword, followed by one or more spaces
or tabs, then the definition itself.
Section trailers start with the keyword END, followed by
the category name.
These definitions start with a keyword that stands for the character
class, followed by one or more blank characters, then a list of symbols for
all characters in that class. You can substitute the character's encoding
for its symbol; however, specifying characters by their encodings diminishes
the readability of the locale source file and makes it impossible to use the
file with more than one codeset.
As shown in the definition of the cntrl class, you can specify
a horizontal elipsis (...) to represent a range of characters.
In the string <NUL>;...;<IS1>, for example, the ellipsis
represents all characters whose encodings are between the character whose
symbol is <NUL> and the character whose symbol is <IS1>. The symbols and their encodings are specified in the charmap file for the locale.
The standard character classes are represented by the following keywords:
From the application standpoint, there is also the class alnum. This class is not defined in a locale; it is by definition a combination
of characters in the alpha and digit classes.
These definitions, which begin with the keywords toupper
and tolower, list symbols in pairs rather than individually. In
the toupper definition shown here, the first symbol in the pair
is the symbol for a lowercase letter and the second symbol is the symbol for
that letter's uppercase equivalent. This definition determines what a letter
is converted to when functions perform case conversion on text data.
The preceding example does not completely illustrate all the options
you can use when defining the LC_CTYPE category. You can:
When you use a copy statement, it must be the only entry
between the section trailer and header.
Character classification is language specific. Therefore, the standard
character classes may not apply to all languages. Define for a locale only
the standard character classes that are appropriate for the locale's language.
Depending on the language, it may be necessary to define nonstandardized classes.
A definition for a nonstandardized character class must be preceded
by the charclass statement to define a keyword for the class, followed
by the class definition. For example:
Applications can use the wctype() and iswctype() functions to determine and test all character classes (including user-defined
ones). Applications can use the class-specific functions iswalpha, iswpunct, and so forth to test the standard character classes.
Refer to the
Following the order_start keyword on the same line are sort
directives, separated by semicolons (;) that apply to each order. Sort directives
can include the following keywords.
When a sort directive includes two keywords, the position
keyword combined with either forward or backward, the
two keywords are separated by a comma (,). The position keyword
by itself is equivalent to the directive forward,position.
The number of sort directives corresponds to the number of weights each
collating element is assigned in subsequent statements.
Each sort directive and its associated set of weights specify information
for one pass, or level, of string comparison. The first directive applies
when the string comparison operation applies the primary weight, the second
when the string comparison operation applies the secondary weight, and so
on. The number of levels required to collate strings correctly depends on
language and cultural requirements and therefore varies from one locale to
another.
There is also a level
number maximum, associated with the COLL_WEIGHTS_MAX setting in
the limits.h and sys/localedef.h files. On Digital UNIX
systems, you are limited to six collation levels (sort directives).
The backward directive is used for many languages to ensure
that accented characters sort after unaccented characters only if the compared
strings are otherwise equivalent.
The position directive is frequently used to handle characters,
such as the hyphen (-) in Western European languages, whose significance can
be relative to word position. For example, assume you wanted the word "o-ring"
to collate in a word list before the word "or-ing", but do not
want the hyphen to be considered until after strings are sorted by letters
alone. You would need two sort directives and associated sets of weight specifiers
to implement this order. For the first comparison operation, you specify forward as the sort directive, letters as the first weights for all
letter characters, and IGNORE as the weight for the hyphen character.
For the second, or a later, comparison operation, you specify forward
position as the sort directive, IGNORE as the weight for
all letter characters, and the hyphen as the weight for the hyphen character.
If you do not specify a sort directive, the default is forward.
These statements specify a character symbol, followed by one or more
blank characters (spaces or tabs), then the symbols for characters that have
the same weight at each stage of the sort. For example, the lowercase character
o, lowercase character o umlaut, uppercase character O, and uppercase character
O umlaut, whose symbols are <o>, <o:>, <O>, and <O:>, respectively, are grouped together (have the
same weight) at the first sort level. At the secondary sort level, lowercase
o is grouped with lowercase o umlaut and uppercase O is grouped with uppercase
O umlaut. The four characters have distinct weights at the tertiary sort level.
The UNDEFINED keyword begins a collation order statement
to be applied to all characters that are defined in the locale's charmap file but not specified in other collation order statements. This statement
indicates that such characters are to be ignored during collation for all
weight comparisons.
You should include a collation order statement that begins with the
UNDEFINED keyword. If this statement is absent, the localedef command
includes undefined characters at the end of the collating order and issues
a warning.
Furthermore, if you place an UNDEFINED statement as the last collation
order statement, the localedef command can sometimes compress all
undefined characters into one entry. This action can reduce the size of the
locale.
The preceding example shows only a few of the options that you can specify
when defining the LC_COLLATE category. You can also use:
A copy statement can be the only entry between the section
trailer and header.
In such cases, you first specify collating-element statements
before the order_start statement to define symbols for the strings.
You can then specify those symbols in collating order statements. For example:
You must define each symbolic name by using the collating-symbol statement in the source file before the order_start statement.
You then include the symbol in the appropriate position in the list of collation
order statements for collating elements. For example, if you wanted the symbols <LOW> to represent the lowest position in the collating order, <LOW> would be the line entry immediately following the order_start statement. A symbol such as <UPPERCASE> would be positioned
on the line immediately preceding the section of collating order statements
for uppercase letters.
A symbol must occur before the first collation order statement in which
it is used. Therefore, you cannot define a symbol for the highest position
in the collating order.
After symbols are defined and positioned, you can use them as weights
in collating order statements. For example:
Refer to the
As an alternative to specifying symbol definitions, you can use the copy statement between the section header and trailer to duplicate an
existing locale's definition of LC_MESSAGES. The copy
statement represents a complete definition of the category and cannot be used
along with explicit symbol definitions.
The entries in the example specify the following:
The following list describes all the symbol names you can define in
the LC_MONETARY section:
The international currency symbol
The local currency symbol
The radix character, or decimal point, used in monetary formats
The character used to separate groups of digits to the left of the radix
character
The size of each group of digits to the left of the radix character
The string indicating that a monetary value is nonnegative
The string indicating that a monetary value is negative
The number of digits to be written to the right of the radix character
when int_curr_symbol appears in the format
The number of digits to be written to the right of the radix character
when currency_symbol appears in the format
An integer that determines if the international or local currency symbol
precedes a nonnegative value
An integer that determines whether a space separates the international
or local currency symbol from other parts of a formatted, nonnegative value
An integer that determines if the international or local currency symbol
precedes a negative value
An integer that determines whether a space separates the international
or local currency symbol from other parts of a formatted, negative value
An integer that indicates if or how the positive sign string is positioned
in a nonnegative, formatted value
An integer that indicates how the negative sign string is positioned
in a negative, formatted value
As an alternative to specifying symbol definitions, you can use the copy statement between the section header and trailer to duplicate an
existing locale's definition of LC_MONETARY. The copy
statement represents a complete definition of the category and cannot be used
along with explicit symbol definitions.
Refer to the
The preceding example shows all of the symbols you can define in the LC_NUMERIC section. In place of any symbol definitions, you can specify
a copy statement between the section header and trailer to include
this section from another locale.
Refer to the
Use the %a conversion specifier to include this string in
formats.
Use the %A conversion specifier to include this string in
formats.
Use the %b conversion specifier to include this string in
formats.
Use the %B conversion specifier to include this string in
formats.
Use this format to combine field descriptors (whose first character
is the percent sign (%)) and symbols for characters. You can specify
characters from the Portable Character Set (PCS), such as the period (.) and ASCII space, explicitly as characters rather than implicitly
through symbols; however, use symbols to specify all other characters.
The specified format includes the field descriptors for the day of the
month (%d), the full name of the month (%B), the full
representation of the year (%Y), the number of hours in a 24-hour
period (%H), the number of minutes (%M), and the number
of seconds (%S). If the date were December 12, 1993, and the time
29 seconds after 12 o'clock in the afternoon, the format specified in this
example would cause the date command to display 12.Dezember
1993 12:00:29.
The preceding example includes only some of the symbol definitions that
are standard for the LC_TIME category. The following definitions
are also standard:
Format for the date alone; corresponds to the %x field descriptor
Format for the time alone; corresponds to the %X field descriptor
Format for the ante meridiem and post meridiem time strings; corresponds
to the %p field descriptor
For example, the definition for English would be:
Format for the time according to the 12-hour clock; corresponds to the %r field descriptor
Definition of how years are counted and displayed for each era (an
Asian date construct) in the locale
Format of the date alone in era notation; corresponds to the %Ex field descriptor
Format of the time alone in era notation; corresponds to the %EX field descriptor
Format of both date and time in era notation; corresponds to the %Ec field descriptor
Definition of alternative symbols for digits (used in Asian locales);
corresponds to the %O field descriptor
As is true for other category sections, you can specify a copy statement to include all LC_TIME definitions from another
locale. Note that Digital UNIX supports symbols and field descriptors
in addition to those described here. Refer to the
Only locales with multibyte codesets must use methods. When a locale
uses methods, there are some methods that the locale must supply and other
methods that it can optionally supply. A method is required when the corresponding
interface is converting characters between data formats and needs codeset-specific
logic to do that operation correctly. A method is optional when the corresponding
interface is working with data after it has been converted to wide-character
format and can apply logic that is valid for both single-byte and multibyte
characters.
Methods must be available on the system in a shareable library. This
library and the functions that implement each method in the library are made
known to the localedef command through a methods file.
When the localedef command processes the methods file
along with the charmap and locale source files, the
resulting locale includes pointers to all methods that are supplied with the
locale, along with pointers to default implementations for optional methods
that are not supplied with the locale. When you set the LANG variable
to the newly built locale and run a command or application, methods are used
wherever they have been enabled in the system software.
This method is similar to the one for mbstowcs (see Section 7.3.1.6) but contains additional parameters to meet
the needs of fgetws(). By convention, a C source file for
this method has the file name __mbstopcs_codeset .c, where codeset identifies the codeset for which the method
is tailored. Example 7-10 shows the file __mbstopcs_sdeckanji.c that defines the __mbstopcs method used with the ja_JP.sdeckanji locale.
This parameter is needed because the fgetws() function
reads from the standard I/O buffer, which does not contain null-terminated
strings.
This value, typically \n, is passed to the method on the
call from the fgetws() function, which handles only one
line of input per call.
This pointer is needed to specify the starting character in the standard
I/O buffer for the next call to fgetws().
The localedef command creates and stores values in the _LC_charmap_t structure.
The err variable contains the return status of the call to
the mbtopc method:
In this case, the return is the number of bytes required to form a valid
character. The fgetws() function can then refill the buffer
and try again.
The codeset supports several character sets and each set contains characters
of only one length. The value in the first byte indicates the character set
and therefore the character length. For character sets with multibyte characters,
one or more additional bytes must be examined to determine whether the value
sequence identifies a character or is invalid.
This value is passed by the calling function.
This operation prevents problems when integer values are stored in the
array and then referenced by index. Compilers apply sign extension to values
when comparing a small signed data type, such as int, to a large
signed data type, such as char. Sign extension means that the high
bit of the value in the small data type is used to fill in bits that remain
when the value is converted to the larger data type for comparison. For example,
if s[0] is the value 0x8e, sign extension would cause it to be
treated as 0xffffff8e. In this case, a condition like the following one would
be evaluated as true when you would expect it to be false:
if (s[0] <= 0x8d
This operation ensures that *pwc always points to a valid
address; otherwise, an application could produce a segmentation fault by referring
to this pointer when a wide character has not been stored in pwc.
If s contains no characters, returns zero (0) to indicate
that no bytes were converted and sets err to 1 to indicate that
1 byte is needed to form a valid character.
If the byte value is in the range being tested, moves the associated
process code value to pwc and returns 1 to indicate the number
of bytes converted.
If yes, moves the associated process code value to the pwc
buffer and returns 2 to indicate the number of bytes converted; otherwise,
returns 0 to indicate that no conversion took place and sets err
to 2 to specify that at least 2 bytes are needed to form a valid character.
If yes, moves the associated process code value to pwc and
returns 3 to indicate the number of bytes converted; otherwise, sets err to 3 to indicate that at least 3 bytes are needed and returns zero
(0) to indicate that no character was converted.
If there are no bytes in the standard I/O buffer, returns zero (0) to
indicate that no bytes were converted and sets err to 1 to indicate
that at least 1 byte is needed to form a valid character.
If the byte value is in the defined range, moves the associated process
code value to pwc and returns 1 to indicate the number of bytes
converted.
If yes, moves the associated process code value to pwc buffer
and returns 2 to indicate the number of bytes converted; otherwise, sets err to 2 to indicate that at least 2 bytes are needed to form a valid
character and returns zero (0) to indicate that no bytes were converted.
These statements execute if the multibyte data in s satisfies
none of the preceding if conditions.
This return causes the fputws() function to use multiple
calls to putwc() to convert wide characters in the string.
If you choose to implement this method fully rather than writing it
to return -1, your function implementation returns the number of wide
characters converted and must include header files and parameters as shown
in the following example:
This value is passed to the method on the call from fputws().
This value is passed to the method on the call from fputws().
If this method calls the wctomb method to perform the character
conversion, the wctomb method sets this status. Otherwise, this
method must incorporate the logic to perform wide-character to multibyte-character
conversion and set the status directly.
In any event, the fputws() function expects the following
values:
In this case, the value is the number of bytes required to store the
next character. The fputws() function can then empty the
multibyte-character buffer and try again.
The __pcstombs method performs the reverse of the
operation that the __mbstopcs method described in Section 7.3.1.3 performs. Because of the direction of the
data conversion, the __pcstombs method:
The codeset supports several character sets and each set contains characters
of only one length. The value in the first byte indicates the character set
and therefore the character length. For character sets with multibyte characters,
one or more additional bytes must be examined to determine whether the value
sequence identifies a character or is invalid.
This value is passed to the method by the mblen()
function.
This operation prevents problems when integer values are stored in the
array and then referenced by index. Compilers apply sign extension to values
when comparing a small signed data type, such as int, to a large
signed data type, such as char. Sign extension means that the high
bit of the value in the small data type is used to fill in bits that remain
when the value is converted to the larger data type for comparison. For example,
if s[0] is the value 0x8e, sign extension would cause it to be
treated as 0xffffff8e. In this case, a condition like the following one would
be evaluated as true when you would expect it to be false:
if (s[0] <= 0x8d
To set errno in a way that works correctly with multithreaded applications,
use _Seterrno rather than an assignment statement.
If yes, returns 1 to indicate that the character length is 1 byte.
If yes, returns 2 to indicate that the character length is 2 bytes.
If yes, returns 3 to indicate that the character length is 3 bytes.
If yes, returns 1 to indicate that the character length is 1 byte.
If yes, returns 2 to indicate that the character length is 2 bytes.
These statements execute if the multibyte data in the standard I/O buffer
satisfies none of the preceding if conditions.
The programmer can request the size of the pwcs buffer (for
memory allocation purposes) by passing a null wide character as the pwcs parameter in the call to mbstowcs(). The programmer
can then use the return value to efficiently allocate memory space for the
application's wide-character buffer before calling mbstowcs() again to actually convert the multibyte string.
Stops processing and returns the number of wide characters in the pwcs buffer if a NULL is encountered; increments the byte position in
the multibyte character buffer by an appropriate number each time a character
is successfully converted.
This while loop uses the condition len-- > 0 to ensure that processing stops when the pwcs buffer is
full. The first if condition in the loop makes sure that, if the
multibyte string in the s buffer is null terminated, the associated
null terminator in the pwcs buffer is not included in the wide-character
count that the mbtowcs() function returns to the application.
This statement executes if the pwcs buffer runs out of space
before a NULL is encountered in the s buffer.
The codeset supports several character sets and each set contains characters
of only one length. The value in the first byte indicates the character set
and therefore the character length. For character sets with multibyte characters,
one or more additional bytes must be examined to determine whether the value
sequence identifies a character or is invalid.
This value is passed from the calling function; the value will have
been set to MB_CUR_MAX on the original call made by the application
programmer.
This operation prevents problems when integer values are stored in the
array and then referenced by index. Compilers apply sign extension to values
when comparing a small signed data type, such as int, to a large
signed data type, such as char. Sign extension means that the high
bit of the value in the small data type is used to fill in bits that remain
when the value is converted to the larger data type for comparison. For example,
if s[0] is the value 0x8e, sign extension would cause it to be
treated as 0xffffff8e. In this case, a condition like the following one would
be evaluated as true when you would expect it to be false:
if (s[0] <= 0x8d
If passed a null pointer, this method should return a value to indicate
whether the locale's character encoding is stateful or stateless. Return
a nonzero value if your locale's character encoding is stateful.
This operation ensures that pwc always points to a valid
address; otherwise, an application could produce a segmentation fault by referring
to this pointer when a wide character has not been stored in pwc.
If yes, stores the associated process code value in the pwc
buffer and returns 1 to indicate that the character length is 1 byte.
If yes, stores the associated process code value in the pwc
buffer and returns 2 to indicate that the character length is 2 bytes.
If yes, stores the associated process code value in the pwc
buffer and returns 3 to indicate that the character length is 3 bytes.
If yes, stores the associated process code value in the pwc
buffer and returns 1 to indicate that the character length is 1 byte.
If yes, stores the associated process code value in the pwc
buffer and returns 2 to indicate that the character length is 2 bytes.
These statements execute if the multibyte data in the s buffer
satisfies none of the preceding if conditions.
This value is supplied by the calling function.
If yes, calls the wctomb method to calculate the number of
bytes required for converted characters (excluding the null terminator) in
the multibyte-character buffer.
The programmer can request the size of the s buffer (for
memory allocation purposes) by passing a null byte as the data in the s parameter on the call to wcstombs(). The programmer
can then use the return value to efficiently allocate memory space for the
application's wide-character buffer before calling wcstombs() again to actually convert the wide-character string.
Each character set supported by the codeset corresponds to a unique
range of wide-character (process code) values and, within each character set,
multibyte characters are of uniform length (1, 2, or 3 bytes). Therefore,
the range in which each wide-character value falls indicates the number of
bytes required for the character in multibyte format; the wide-character value
itself determines the specific byte value or values for the character in multibyte
format.
These statements execute if the wide-character values satisfies none
of the preceding conditions.
Note that each character's display width is either 1 or 2 columns, depending
on the character set to which a character belongs. Display width is different
from the size of the character in multibyte format; for example, triple-byte
characters require 2 display columns and double-byte characters can require
either 1 or 2 display columns.
This statement executes if a value that satisfies none of the preceding
conditions is encountered in the string. The calling function, wcswidth(), also returns -1 if the wide character is nonprintable; however,
this condition is evaluated at the level of the calling function and does
not need to be evaluated by the method.
Note that a character's display width is either 1 or 2 columns, depending
on the character set to which a character belongs. Display width is different
from the size of the character in multibyte format; for example, triple-byte
characters require 2 display columns and double-byte characters can require
either 1 or 2 display columns.
The calling function, wcwidth(), also returns -1
if the wide character is nonprintable; however, this condition is evaluated
at the level of the calling function and does not need to be evaluated by
the method.
Writing optional methods requires detailed information about the internal
interfaces to C library routines. This information is proprietary to Digital
and may be subject to change. In the rare cases where your locale must include
an optional method, contact your Digital technical support representative
to request information.
Example 7-21 shows the compiler and linker command
lines that are required to build the method source files into a shareable
library that is used with the ja_JP.sdeckanji locale.
Example 7-22 shows the section of a methods file for the methods used with the ja_JP.sdeckanji locale.
Because there is a mandatory list of methods that you must define if you
want to override any C library interfaces, your methods file must
always specify an entry for each of the required methods as shown in this
example. The ja_JP.sdeckanji locale relies on default implementations
for all optional methods, so Example 7-22 does not contain
entries for any of the optional methods.
These lines specify
the name of the methods file and the format of method entries. Note that the
field identified in the format as <package> is ignored, but
you must specify some string for this field in order to specify a library
path.
Refer to the
When you are testing
locales, particularly ones that are similar to standard locales installed
on the system, you should add an extension to the locale name. Varying names
with the at (@) extension allows you to specify the standard strings
for language, territory, and codeset and still be sure that the test locale
is uniquely identified. This is important if you later decide to move the
locale to the directory /usr/lib/nls/loc where other locales reside.
Example 7-23 shows only one form and a few options
for the localedef command. The
By default, locales must reside in the /usr/lib/nls/loc directory to be found. If you want to test your locale
before moving it to the /usr/lib/nls/loc directory, you can define
the LOCPATH variable to specify the directory where your locale
is located. You can then define the LANG environment variable
to be your new locale and interactively test the locale with commands and
applications.
Example 7-24 uses the date command to
test the date/time format.
Some programs have support files that are installed in system directories
with names that exactly match the names of standard locales. In such cases,
application software, system software, or both might use the value of the LANG environment variable to determine the locale-specific directory
in which the support files reside. If assigned directly to the LANG
or LC_ALL environment variable, locale file names with an at (@)
suffix may result in invalid search paths for some applications. The following
example shows how you can work around this problem by assigning the standard
locale name to the LANG variable and the name of your variant locale
to the locale category variables. You need to make assignments only to those
category variables that represent areas where your locale differs from the
locale on which it is based.
7.1 Creating a Character Map Source File for a Locale
A charmap file defines symbols
for character binary encodings. The localedef command uses this
file to map character symbols in a locale source file to the character encodings. Example 7-1 shows a fragment of the source file, ISO8859-1.cmap, used for thede_DE.ISO8859-1@example locale being developed
in this chapter. Appendix B contains this file in its entirety.
Example 7-1: The charmap File for a Sample Locale
# Map file providing symbols for characters whose binary (1)
# encodings are specified in the ISO Latin-1 codeset. (1)
<code_set_name> "ISO8859-1" (2)
<mb_cur_max>1 (2)
<mb_cur_min> 1 (2)
<escape_char> \ (2)
<comment_char> # (2)
CHARMAP (3)
<NU> \d000 (4)
<SH> \d001
<SX> \d002
<EX> \d003
<ET> \d004
<EQ> \d005
<AK> \d006
<BL> \d007
<BS> \d008
.
.
.
<0> \d048 (4)
<1> \d049
<2> \d050
<3> \d051
.
.
.
<A> \d065 (4)
<B> \d066
<C> \d067
<D> \d068
<E> \d069
.
.
.
<X> \d088 (4)
<Y> \d089
<Z> \d090
<<(> \d091
<//> \d092
<)\>> \d093
<'\>> \d094
<_> \d095
<'!> \d096
<a> \d097
<b> \d098
<c> \d099
<d> \d100
<e> \d101
.
.
.
<x>\d120 (4)
<y> \d121
<z> \d122
<(!> \d123
<!!> \d124
<!)> \d125
<'?> \d126
<DT> \d127
.
.
.
<O:> \d214 (4)
<U:> \d220
.
.
.
<ss> \d223 (4)
.
.
.
<o:> \d246 (4)
.
.
.
<u:> \d252 (4)
.
.
.
<backspace> \d008 (5)
<tab> \d009
<newline> \d010
<vertical-tab> \d011
<form-feed> \d012
<carriage-return> \d013
.
.
.
<space> \d032 (5)
<exclamation-mark> \d033
<quotation-mark> \d063
<number-sign> \d035
<dollar-sign> \d036
END CHARMAP (6)
Example 7-2: Fragment from a charmap File for a Multibyte Codeset
# SJIS charmap
#
<code_set_name> "SJIS" (1)
<mb_cur_min> 1 (2)
<mb_cur_max>2 (3)
CHARMAP
#
# CS0: ASCII
#
.
.
.
<commercial-at> \x40 (4)
<A> \x41 (4)
<B> \x42 (4)
.
.
.
#
# CS1: JIS X0208-1983 for ShiftJIS.
#
<zenkaku-space> \x81\x40 (5)
<j0101>...<j0163> \x81\x40 (5)
<j0164>...<j0194> \x81\x80 (5)
.
.
.
#
# UDC Area in JIS X0208 plane
#
<u8501>...<u8563> \xeb\x40 (6)
<u8564>...<u8594> \xeb\x80 (6)
<u8601>...<u8663> \xeb\x9f (6)
.
.
.
#
# CS2: JIS X0201 (so-called Hankaku-Kana)
#
<kana-fullstop> \xa1 (7)
.
.
.
<kana-conjunctive> \xa5 (7)
<kana-WO> \xa6 (7)
<kana-a> \xa7 (7)
.
.
.
END CHARMAP
charmap
(4) reference page for a complete list of rules that
apply to character map source files.
Note
7.2 Creating Locale Definition Source Files
A locale definition
source file defines data that is specific to a particular language and territory.
The source file is organized into sections, one for each category of locale
data being defined. Example 7-3 shows the structure
of a locale definition source file in pseudocode. The sections for locale
categories are discussed in more detail following the example.
Example 7-3: Structure of Locale Source Definition File
# comment-line (1)
comment_char <char_symbol1> (2)
escape_char <char_symbol2> (3)
CATEGORY_NAME (4)
category_definition-statement (5)
category_definition-statement (5)
.
.
.
END CATEGORY_NAME (6)
.
.
.
(7)
7.2.1 Defining the LC_CTYPE Locale Category
The LC_CTYPE section defines character classes and character attributes used in
operations such as case conversion. Example 7-4 shows the
definition for this section.
Example 7-4: LC_CTYPE Category Definition
LC_CTYPE (1)
upper <A>;<A:>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;<N>;<O>;\
<O:>;<P>;<Q>;<R>;<S>;<T>;<U>;<U:>;<V>;<W>;<X>;<Y>;<Z> (2)
lower <a>;<a:>;<b>;<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;<n>;<o>;\
<o:>;<p>;<q>;<r>;<s>;<ss>;<t>;<u>;<u:>;<v>;<w>;<x>;<y>;<z> (2)
alpha <A>;<A:>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;<N>;<O>;\
<O:>;<P>;<Q>;<R>;<S>;<T>;<U>;<U:>;<V>;<W>;<X>;<Y>;<Z>;<a>;<a:>;<b>;\
<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;<n>;<o>;<o:>;<p>;<q>;<r>;\
<s>;<ss>;<t>;<u>;<u:>;<v>;<w>;<x>;<y>;<z> (2)
space <tab>;<newline>;<vertical-tab>;<form-feed>;<carriage-return>;<space>;\
<NS> (2)
cntrl <NUL>;...;<IS1>;<DEL>;...;<AC> (2)
.
.
.
toupper (<a>,<A>);(<a:>,<A:>);(<b>,<B>);(<c>,<C>);(<d>,<D>);(<e>,<E>);\
(<f>,<F>);(<g>,<G>);(<h>,<H>);(<i>,<I>);(<j>,<J>);(<k>,<K>);\
(<l>,<L>);(<m>,<M>);(<n>,<N>);(<o>,<O>);(<o:>,<O:>);(<p>,<P>);\
(<q>,<Q>);(<r>,<R>);(<s>,<S>);(<t>,<T>);(<u>,<U>);(<u:>,<U:>);\
(<v>,<V>);(<w>,<W>);(<x>,<X>);(<y>,<Y>);(<z>,<Z>) (3)
.
.
.
END LC_CTYPE (4)
charclass vowel
vowel <a>;<e>;<i>;<o>;<u>;<y>
locale
(4) reference page for additional rules and restrictions
that apply to the LC_CTYPE category definition.7.2.2 Defining the LC_COLLATE Locale Category
The LC_COLLATE section specifies how characters and strings are collated. Example 7-5 shows part of an LC_COLLATE section.
Example 7-5: LC_COLLATE Category Definition
LC_COLLATE (1)
order_start forward;forward;backward (2)
.
.
.
<o> <o>;<o>;<o> (3)
.
.
.
<o:> <o>;<o>;<o:> (3)
.
.
.
<O> <o>;<O>;<O> (3)
.
.
.
<O:> <o>;<O>;<O:> (3)
.
.
.
<Z> <z>;<Z>;<Z> (3)
.
.
.
UNDEFINED IGNORE;IGNORE;IGNORE (4)
order_end (5)
END LC_COLLATE (6)
collating-element <ch> from "<c><h>"
.
.
.
order_start forward;forward;backward
.
.
.
<ch> <Ch>;<ch>;<ch>
.
.
.
collating-symbol <LOWERCASE>
collating-symbol <UNACCENTED>
.
.
.
order_start forward;backward;forward;forward
.
.
.
<UNACCENTED>
.
.
.
<LOWERCASE>
<a> <a>;<UNACCENTED>;<LOWERCASE>;IGNORE
.
.
.locale
(4) reference page for more detailed information
on the LC_COLLATE category definition.7.2.3 Defining the LC_MESSAGES Locale Category
The LC_MESSAGES section defines strings that are valid for affirmative and negative
responses from users. Example 7-6 shows an LC_MESSAGES section.
Example 7-6: LC_MESSAGES Category Definition
LC_MESSAGES (1)
yesexpr "^[<j><J>][[:alpha:]]*" (2)
noexpr "^[<n><N>][[:alpha:]]*" (3)
yesstr "<j>" (4)
nostr "<n>" (5)
END LC_MESSAGES (6)
7.2.4 Defining the LC_MONETARY Locale Category
The LC_MONETARY section of the locale source file defines the rules and
symbols used to format monetary values. Application developers use the localeconv() and nl_langinfo() functions
to determine the information defined in this section and apply formatting
rules through the strfmon() function. Example 7-7
shows an LC_MONETARY section.
Example 7-7: LC_MONETARY Category Definition
LC_MONETARY (1)
int_curr_symbol "<D><M>" (2)
currency_symbol "<D><M>" (2)
mon_decimal_point "<,>" (2)
mon_thousands_sep "<.>" (2)
mon_grouping 3 (2)
positive_sign "" (2)
negative_sign "<->" (2)
.
.
.
END LC_MONETARY (3)
locale
(4) reference page for complete information about
specifying LC_MONETARY symbol definitions.7.2.5 Defining the LC_NUMERIC Locale Category
The LC_NUMERIC section of the locale source file defines the
rules and symbols used to format numeric data. You can use the localeconv() and nl_langinfo() functions to access this formatting
information. Example 7-8 shows this section.
Example 7-8: LC_NUMERIC Category Definition
LC_NUMERIC (1)
decimal_point "<,>" (2)
thousands_sep "<.>" (3)
grouping 3 (4)
END LC_NUMERIC (5)
locale
(4) reference page for detailed rules about symbol
definitions.7.2.6 Defining the LC_TIME Locale Category
The LC_TIME section defines the interpretation of field descriptors supported by
the date command. This category section also affects the behavior
of the strftime(), wcsftime(), strptime(), and nl_langinfo() functions. Example 7-9 shows some of the symbols defined for the sample German
locale.
Example 7-9: LC_TIME Category Definition
LC_TIME (1)
abday "<S><o>";"<M><o>";"<D><i>";"<M><i>";"<D><o>";\
"<F><r>";"<S><a>" (2)
day "<S><o><n><n><t><a><g>";"<M><o><n><t><a><g>";\
"<D><i><e><n><s><t><a><g>";\
"<M><i><t><t><w><o><c><h>";\
"<D><o><n><n><e><r><s><t><a><g>";\
"<F><r><e><i><t><a><g>";"<S><a><m><s><t><a><g>" (3)
abmon "<J><a><n>";"<F><e><b>";"<M><a:><r>";\
"<A><p><r>";"<M><a><i>";"<J><u><n>";\
"<J><u><l>";"<A><u><g>";"<S><e><p>";\
"<O><k><t>";"<N><o><v>";"<D><e><z>" (4)
mon "<J><a><n><u><a><r>";"<F><e><b><r><u><a><r>";\
"<M><a:><r><z>";"<A><p><r><i><l>";"<M><a><i>";\
"<J><u><n><i>";"<J><u><l><i>";\
"<A><u><g><u><s><t>";\
"<S><e><p><t><e><m><b><e><r>";\
"<O><k><t><o><b><e><r>";\
"<N><o><v><e><m><b><e><r>";\
"<D><e><z><e><m><b><e><r>" (5)
d_t_fmt "%d.%B %Y %H:%M:%S" (6)
.
.
.
END LC_TIME (7)
am_pm "<A><M>";"<P><M>"
locale
(4) reference page
for more complete information.7.3 Building Libraries to Convert Multibyte/Wide-Character Encodings
C library routines
rely on a set of special interfaces to convert characters to and from data
file encoding and wide-character encoding (internal process code). By default,
the C library routines use interfaces that handle only single-byte characters.
However, many are defined with entry points that permit use of alternative
interfaces for handling multibyte-characters. The interfaces that can be tailored
to a locale's codeset are called methods.7.3.1 Required Methods
If your locale uses methods, it must
supply the following methods; without these methods, it is impossible for
C Library functions to convert data between multibyte and wide-character formats:
7.3.1.1 Writing the __mbstopcs Method for the fgetws Function
The fgetws() function
uses the __mbstopcs method to convert the bytes in the
standard I/0 (stdio) buffer to a wide-character string. The function
that implements this method must return the number of wide characters converted
by the call.
Example 7-10: The __mbstopcs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h> (1)
#include <sys/localedef.h> (1)
int __mbstopcs_sdeckanji(
wchar_t *pwcs, (2)
size_t pwcs_len, (3)
const char *s, (4)
size_t s_len, (5)
int stopchr, (6)
char **endptr, (7)
int *err, (8)
_LC_charmap_t *handle ) (9)
{
int cnt = 0; (10)
int pwcs_cnt = 0; (10)
int s_cnt = 0; (10)
*err = 0; (11)
while (1) { (12)
if (pwcs_cnt >= pwcs_len || s_cnt >= s_len) {
*endptr = (char *)&(s[s_cnt]);
break;
} (13)
if ((cnt = __mbtopc_sdeckanji(&(pwcs[pwcs_cnt]),
&(s[s_cnt]), (s_len - s_cnt), err)) == 0) {
*endptr = (char *)&(s[s_cnt]);
break;
} (14)
pwcs_cnt++; (15)
if ( s[s_cnt] == (char) stopchr) {
*endptr = (char *)&(s[s_cnt+1]);
break;
} (16)
s_cnt += cnt; (17)
} (18)
return (pwcs_cnt); (19)
}
7.3.1.2 Writing the __mbtopc Method for the getwc() Function
The getwc() or fgetwc() function calls the __mbtopc method
to convert a multibyte character to a wide character. The method returns the
number of bytes in the multibyte character that is converted. This method
is similar to the one for mbtowc (see Section 7.3.1.7)
but contains an additional parameter that getwc() needs.
By convention, a C source file for this method has the file name __mbtopc_codeset .c, where codeset identifies the
codeset for which this method is tailored. Example 7-11
shows the file __mbtopc_sdeckanji.c that defines the __mbtopc method used with the ja_JP.sdeckanji locale.
Example 7-11: The __mbtopc_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
s[0] < 0x9f: PC = s[0]
s[0] = 0x8e: PC = s[1] + 0x5f;
s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
0x21 < s[1] < 0x7e
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ (2)
int __mbtopc_sdeckanji(
wchar_t *pwc, (3)
char *ts, (4)
size_t maxlen, (5)
int *err, (6)
_LC_charmap_t *handle ) (7)
{
wchar_t dummy; (8)
unsigned char *s = (unsigned char *)ts; (9)
if (s == NULL)
return(0); (10)
if (pwc == (wchar_t *)NULL)
pwc = &dummy; (11)
*err = 0; (12)
if (s[0] <= 0x8d) {
if (maxlen < 1) {
*err = 1;
return(0);
}
else {
*pwc = (wchar_t) s[0];
return(1);
}
} (13)
else if (s[0] == 0x8e) {
if (maxlen >= 2) {
if (s[1] >=0xa1 && s[1] <=0xfe) {
*pwc = (wchar_t) (s[1] + 0x5f);
return(2);
}
}
else {
*err = 2;
return(0);
}
} (14)
else if (s[0] == 0x8f) {
if (maxlen >= 3) {
if ((s[1] >=0xa1 && s[1] <=0xfe) &&
(s[2] >=0xa1 && s[2] <= 0xfe)) {
*pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
(wchar_t) (s[2] - 0xa1)) + 0x303c;
return(3);
}
}
else {
*err = 3;
return(0);
}
} (15)
else if (s[0] <= 0x9f) {
if (maxlen < 1) {
*err = 1;
return(0);
}
else {
*pwc = (wchar_t) s[0];
return(1);
}
} (16)
else if (s[0] >= 0xa1 && s[0] <= 0xfe) {
if (maxlen >= 2) {
if (s[1] >=0xa1 && s[1] <= 0xfe) {
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t) (s[1] - 0xa1)) + 0x15e;
return(2);
} else if (s[1] >=0x21 && s[1] <= 0x7e) {
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t) (s[1] - 0x21)) + 0x5f1a;
return(2);
}
}
else {
*err = 2;
return(0);
}
} (17)
*err = -1;
return(0); (18)
}
7.3.1.3 Writing the __pcstombs Method for the fputws() Function
The fputws() function
first calls the __pcstombs method to convert a string of
characters from process (wide-character) code to multibyte code. If this method
returns -1 to indicate no support by the locale, fputws() then calls putwc() for each wide character in the
string being converted. By convention, a C source file for this method has
the file name __pcstombs_codeset .c,
where codeset identifies the codeset for which this method is tailored. Example 7-12 shows the file __pcstombs_sdeckanji.c that defines the __pcstombs method used with the ja_JP.sdeckanji locale.
Example 7-12: The __pcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale
int __pcstombs_sdeckanji()
{
return -1; (1)
}
#include <stdlib.h>
#include <wchar.h>
#include <sys/localedef.h>
int __pcstombs_newcodeset(
wchar_t *pcsbuf, (1)
size_t pcsbuf_len, (2)
char *mbsbuf, (3)
size_t mbsbuf_len, (4)
char **endptr, (5)
int *err, (6)
_LC_charmap_t *handle ) (7)
7.3.1.4 Writing a __pctomb Method
C Library functions currently do not use
the __pctomb interface. The putwc()
function, for example, calls the wctomb method to convert a character
from wide-character to multibyte-character format. Nonetheless, the localedef command requires a method for this function when your locale
supplies methods. By convention, a C source file for this method has the file
name__pctomb_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-13 shows the file __pctomb_sdeckanji.c that defines the __pctomb method used with the ja_JP.sdeckanji locale.
Example 7-13: The __pctomb_sdeckanji Method for the ja_JP.sdeckanji Locale
int __pctomb_sdeckanji()
{
return -1; (1)
}
7.3.1.5 Writing a Method for the mblen Function
The mblen() function
uses the mblen method to return the number of bytes in a multibyte
character. By convention, a C source file for this method has the file name __mblen_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-14
shows the file __mblen_sdeckanji.c that defines the mblen method used with the ja_JP.sdeckanji locale.
Example 7-14: The __mblen_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/errno.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
s[0] < 0x9f: 1 byte
s[0] = 0x8e: 2 bytes
s[0] = 0x8f 3 bytes
s[0] > 0xa1 2 bytes
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ (2)
int __mblen_sdeckanji(
char *fs, (3)
size_t maxlen, (4)
_LC_charmap_t *handle ) (5)
{
const unsigned char *s = (void *) fs; (6)
if (s == NULL || *s == '\0')
return(0); (7)
if (maxlen < 1) {
_Seterrno(EILSEQ);
return((size_t)-1);
} (8)
if (s[0] <= 0x8d)
return(1); (9)
else if (s[0] == 0x8e) {
if (maxlen >= 2 && s[1] >=0xa1 && s[1] <=0xfe)
return(2);
} (10)
else if (s[0] == 0x8f) {
if(maxlen >=3 && (s[1] >=0xa1 && s[1] <=0xfe) &&
(s[2] >=0xa1 && s[2] <= 0xfe))
return(3);
} (11)
else if (s[0] <= 0x9f)
return(1); (12)
else if (s[0] >= 0xa1) {
if (maxlen >=2 && (s[0] <= 0xfe) )
if ( (s[1] >=0xa1 && s[1] <= 0xfe) ||
(s[1] >=0x21 && s[1] <= 0x7e) )
return(2);
} (13)
_Seterrno(EILSEQ);
return((size_t)-1); (14)
}
7.3.1.6 Writing a Method for the mbstowcs Function
The mbstowcs() function
uses the mbstowcs method to convert a multibyte character string
to process (wide-character) code and to return the number of resultant wide
characters. By convention, a C source file for this method has the file name __mbstowcs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-15
shows the file __mbstowcs_sdeckanji.c that defines the mbstowcs method used with the ja_JP.sdeckanji locale.
Example 7-15: The __mbstowcs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/localedef.h>
size_t __mbstowcs_sdeckanji(
wchar_t *pwcs, (2)
const char *s, (3)
size_t n, (4)
_LC_charmap_t *handle ) (5)
{
int len = n; (6)
int rc; (7)
int cnt; (8)
wchar_t *pwcs0 = pwcs; (9)
int mb_cur_max; (10)
if (s == NULL)
return (0); (11)
mb_cur_max = MB_CUR_MAX; (12)
if (pwcs == (wchar_t *)NULL) {
cnt = 0;
while (*s != '\0') {
if ((rc = __mblen_sdeckanji(s, mb_cur_max, handle)) == -1)
return(-1);
cnt++ ;
s += rc;
}
return(cnt);
} (13)
while (len-- > 0) {
if ( *s == '\0') {
*pwcs = (wchar_t) '\0';
return (pwcs - pwcs0);
}
if ((cnt = __mbtowc_sdeckanji(pwcs, s, mb_cur_max, handle)) < 0)
return(-1);
s += cnt;
++pwcs;
} (14)
return (n); (15)
}
7.3.1.7 Writing a Method for the mbtowc Function
The mbtowc(\) function uses the mbtowc method to convert a multibyte character
to a wide character and to return the number of bytes in the multibyte character
that was converted. By convention, a C source file for this method has the
file name __mbtowc_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-16 shows the file __mbtowc_sdeckanji.c that defines the mbtowc method used with the ja_JP.sdeckanji locale.
Example 7-16: The __mbtowc_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/errno.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
s[0] < 0x9f: PC = s[0]
s[0] = 0x8e: PC = s[1] + 0x5f;
s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
0x21 < s[1] < 0x7e
PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ (2)
int __mbtowc_sdeckanji(
wchar_t *pwc, (3)
const char *ts, (4)
size_t maxlen, (5)
_LC_charmap_t *handle ) (6)
{
unsigned char *s = (unsigned char *)ts; (7)
wchar_t dummy; (8)
if (s == NULL)
return(0); (9)
if (maxlen < 1) {
_Seterrno(EILSEQ);
return((size_t)-1);
} (10)
if (pwc == (wchar_t *)NULL)
pwc = &dummy; (11)
if (s[0] <= 0x8d) {
*pwc = (wchar_t) s[0];
if (s[0] != '\0')
return(1);
else
return(0);
} (12)
else if (s[0] == 0x8e) {
if ( (maxlen >= 2) && ((s[1] >=0xa1) && (s[1] <=0xfe))) {
*pwc = (wchar_t) (s[1] + 0x5f); /* 0x100 - 0xa1 */
return(2);
}
} (13)
else if (s[0] == 0x8f) {
if((maxlen >= 3) && (((s[1] >=0xa1) && (s[1] <=0xfe))
&& ((s[2] >=0xa1) && (s[2] <= 0xfe)))) {
*pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
(wchar_t) (s[2] - 0xa1)) + 0x303c;
return(3);
}
} (14)
else if (s[0] <= 0x9f) {
*pwc = (wchar_t) s[0];
if (s[0] != '\0')
return(1);
else
return(0);
} (15)
else if (((s[0] >= 0xa1) && (s[0] <= 0xfe)) && (maxlen >= 2)){
if (((s[1] >=0xa1) && (s[1] <= 0xfe))){
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t)(s[1] - 0xa1)) + 0x15e;
return(2);
} else if (((s[1] >=0x21) && (s[1] <= 0x7e))){
*pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
(wchar_t)(s[1] - 0x21)) + 0x5f1a;
return(2);
}
} (16)
_Seterrno(EILSEQ);
return(-1); (17)
}
7.3.1.8 Writing a Method for the wcstombs Function
The wcstombs() function calls the wcstombs method to convert a wide-character
string to a multibyte-character string and to return the number of bytes in
the resultant multibyte-character string. By convention, a C source file for
this method has the file name __wcstombs_codeset .c, where codeset identifies the codeset for which this method
is tailored. Example 7-17 shows the file __wcstombs_sdeckanji.c that defines the wcstombs method used with the ja_JP.sdeckanji locale.
Example 7-17: The __wcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <limits.h>
#include <sys/localedef.h>
size_t __wcstombs_sdeckanji(
char *s, (2)
const wchar_t *pwcs, (3)
size_t n, (4)
_LC_charmap_t *handle ) (5)
{
int cnt=0; (6)
int len=0; (7)
int i=0; (8)
char tmps[MB_LEN_MAX+1]; (9)
if ( s == (char *)NULL) {
cnt = 0;
while (*pwcs != (wchar_t)'\0') {
if ((len = __wctomb_sdeckanji(tmps, *pwcs)) == -1)
return(-1);
cnt += len;
pwcs++;
}
return(cnt);
} (10)
if (*pwcs == (wchar_t)'\0') {
*s = '\0';
return(0);
} (11)
while (1) { (12)
if ((len = __wctomb_sdeckanji(tmps, *pwcs)) == -1)
return(-1); (13)
else if (cnt+len > n) {
*s = '\0';
break;
} (14)
if (tmps[0] == '\0') {
*s = '\0';
break;
} (15)
for (i=0; i<len; i++) {
*s = tmps[i];
s++;
} (16)
cnt += len; (17)
if (cnt == n)
break; (18)
pwcs++; (19)
} (20)
if (cnt == 0)
cnt = len; (21)
return (cnt); (22)
}
7.3.1.9 Writing a Method for the wctomb Function
The wctomb() function
calls the wctomb method to convert a wide character to a multibyte
character and to return the number of bytes in the resultant multibyte character.
By convention, a C source file for this method has the file name __wctomb_codeset .c, where codeset identifies the
codeset for which this method is tailored. Example 7-18
shows the file __wctomb_sdeckanji.c that defines the wctomb method for the ja_JP.sdeckanji locale.
Example 7-18: The __wctomb_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/errno.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
PC <= 0x009f: s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
s[1] = ((PC - 0x303c) >> 7) + 0x00a1
s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ (2)
int __wctomb_sdeckanji(
char *s, (3)
wchar_t wc, (4)
_LC_charmap_t *handle ) (5)
{
if (s == (char *)NULL)
return(0); (6)
if (wc <= 0x9f) {
s[0] = (char) wc;
return(1);
} (7)
else if ((wc >= 0x0100) && (wc <= 0x015d)) {
s[0] = 0x8e;
s[1] = wc - 0x5f;
return(2);
} (8)
else if ((wc >=0x015e) && (wc <= 0x303b)) {
s[0] = (char) (((wc - 0x015e) >> 7) + 0x00a1);
s[1] = (char) (((wc - 0x015e) & 0x007f) + 0x00a1);
return(2);
} (9)
else if ((wc >=0x303c) && (wc <= 0x5f19)) {
s[0] = 0x8f;
s[1] = (char) (((wc - 0x303c) >> 7) + 0x00a1);
s[2] = (char) (((wc - 0x303c) & 0x007f) + 0x00a1);
return(3);
} (10)
else if ((wc >=0x5f1a) && (wc <= 0x8df7)) {
s[0] = (char) (((wc - 0x5f1a) >> 7) + 0x00a1);
s[1] = (char) (((wc - 0x5f1a) & 0x007f) + 0x0021);
return(2);
} (11)
_Seterrno(EILSEQ);
return(-1); (12)
}
7.3.1.10 Writing a Method for the wcswidth Function
The wcswidth() function uses the wcswidth method to determine the number
of columns required to display a wide-character string. By convention, a C
source file for this method has the file name __wcswidth_codeset .c, where codeset identifies the
codeset for which this method is tailored. Example 7-19
shows the file __wcswidth_sdeckanji.c that defines the wcswidth method used for the ja_JP.sdeckanji locale.
Example 7-19: The __wcswidth_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
PC <= 0x009f: s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
s[1] = ((PC - 0x303c) >> 7) + 0x00a1
s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ (2)
int __wcswidth_sdeckanji(
const wchar_t *wcs, (3)
size_t n, (4)
_LC_charmap_t *hdl ) (5)
{
int len; (6)
int i; (7)
if (wcs == (wchar_t *)NULL || *wcs == (wchar_t)NULL)
return(0); (8)
len = 0; (9)
for (i=0; wcs[i] != (wchar_t)NULL && i<n; i++) { (10)
if (wcs[i] <= 0x9f)
len += 1; (11)
else if ((wcs[i] >= 0x0100) && (wcs[i] <= 0x015d))
len += 1; (12)
else if ((wcs[i] >=0x015e) && (wcs[i] <= 0x303b))
len += 2; (13)
else if ((wcs[i] >=0x303c) && (wcs[i] <= 0x5f19))
len += 2; (14)
else if ((wcs[i] >=0x5f1a) && (wcs[i] <= 0x8df7))
len += 2; (15)
else
return(-1); (16)
} (17)
return(len); (18)
}
7.3.1.11 Writing a Method for the wcwidth Function
The wcwidth() function
uses the wcwidth method to determine the number of columns required
to display a wide character. By convention, a C source file for this method
has the file name __wcwidth_codeset .c, where codeset identifies the codeset for which this method
is tailored. Example 7-20 shows the file __wcwidth_sdeckanji.c that defines the wcwidth method used with the ja_JP.sdeckanji locale.
Example 7-20: The __wcwidth_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h> (1)
#include <wchar.h>
#include <sys/localedef.h>
/*
The algorithm for this conversion is:
PC <= 0x009f: s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
s[1] = ((PC - 0x303c) >> 7) + 0x00a1
s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021
+-----------------+-----------+-----------+-----------+
| process code | s[0] | s[1] | s[2] |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f | -- | -- |
| 0x00a0 - 0x00ff | -- | -- | -- |
| 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208
| 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC
+-----------------+-----------+-----------+-----------+
*/ (2)
int __wcwidth_sdeckanji(
wint_t wc, (3)
_LC_charmap_t *hdl ) (4)
{
if (wc == 0)
return(0); (5)
if (wc <= 0x9f)
return(1); (6)
else if ((wc >= 0x0100) && (wc <= 0x015d))
return(1); (7)
else if ((wc >=0x015e) && (wc <= 0x303b))
return(2); (8)
else if ((wc >=0x303c) && (wc <= 0x5f19))
return(2); (9)
else if ((wc >=0x5f1a) && (wc <= 0x8df7))
return(2); (10)
return(-1); (11)
}
7.3.2 Optional Methods
A locale can include methods in addition
to those discussed in Section 7.3.1. If your locale uses
methods but does not supply any for the functions associated with particular
locale categories or some other locale-related functions, the localedef command applies default methods that handle process code for both single-byte
and multibyte characters. The following list names the optional methods:
7.3.3 Building a Shareable Library to Use with a Locale
Example 7-21: Building a Library of Methods Used with the ja_JP.sdeckanji Locale
cc -std0 -c \
__mblen_sdeckanji.c __mbstopcs_sdeckanji.c \
__mbstowcs_sdeckanji.c __mbtopc_sdeckanji.c \
__mbtowc_sdeckanji.c __pcstombs_sdeckanji.c \
__pctomb_sdeckanji.c __wcstombs_sdeckanji.c \
__wcswidth_sdeckanji.c __wctomb_sdeckanji.c \
__wcwidth_sdeckanji.c
ld -shared -set_version osf.1 -soname libsdeckanji.so -shared \
-no_archive -o libsdeckanji.so \
__mblen_sdeckanji.o __mbstopcs_sdeckanji.o \
__mbstowcs_sdeckanji.o __mbtopc_sdeckanji.o \
__mbtowc_sdeckanji.o __pcstombs_sdeckanji.o __pctomb_sdeckanji.o \
__wcstombs_sdeckanji.o __wcswidth_sdeckanji.o __wctomb_sdeckanji.o \
__wcwidth_sdeckanji.o \
-lc
Refer to the cc
(1) and ld
(1) reference pages for more information
about the cc and ld commands and how you build shared
libraries.7.3.4 Creating a methods File for a Locale
The methods file contains an entry for each function that is defined in
the methods shared library for use with the locale. The operation performed
by the function is identified by a method keyword, followed by quoted strings
with the name of the function and the path to the shared library that contains
the function.
Example 7-22: The methods File for the ja_JP.sdeckanji Locale
# sdeckanji.m (1)
# <method_keyword> "<entry>" "<package>" "<library_path>" (1)
METHODS (2)
__mbstopcs "__mbstopcs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
__mbtopc "__mbtopc_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
__pcstombs "__pcstombs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
__pctomb "__pctomb_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
mblen "__mblen_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
mbstowcs "__mbstowcs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
mbtowc "__mbtowc_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
wcstombs "__wcstombs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
wcswidth "__wcswidth_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
wctomb "__wctomb_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
wcwidth "__wcwidth_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so" (3)
END METHODS (4)
localedef
(1) reference page for detailed information
about methods file entries.7.4 Building and Testing the Locale
Use the localedef command
to build a locale from its source files. Example 7-23
shows the command line needed to build the German locale used in most examples
in this chapter. Assume for this example that all source files reside in the
user's default directory and that the resulting locale is also created in
that directory.
Example 7-23: Building the de_DE.ISO8859-1@example Locale
% localedef -f ISO8859-1.cmap \ (1)
-i de_DE.ISO8859-1.lscr \ (2)
de_DE.ISO8859-1@example (3)
localedef
(1) reference page is
a complete description of the command. The following is a summary of some
important rules and options:
Example 7-24: Setting the LOCPATH Variable and Testing a Locale
% setenv LOCPATH ~harry/locales
% setenv LANG de_DE.ISO8859-1@example
% date
12.Dezember 1993 09:18:11
Note
% setenv LANG de_DE.ISO8859-1
% setenv LC_CTYPE de_DE.ISO8859-1@example
% setenv LC_COLLATE de_DE.ISO8859-1@example
.
.
.
% setenv LC_TIME de_DE.ISO8859-1@example