The charmap
(4) reference page explains the format and rules
for this file. This chapter includes a charmap example that conforms
to binary character encodings specified for the ISO Latin-1 codeset, which
defines all characters as single 8-bit bytes. The chapter also includes an
example that shows part of a charmap file for the SJIS codeset,
which defines both single-byte and multibyte characters.
The locale
(4) reference page explains the rules and format for
this file. This chapter develops a locale named de_DE.ISO8859-1@example that supports the language and customs of Germany.
These files are required when the charmap file defines multibyte characters; otherwise, the files are optional. The methods file specifies the shareable library that contains redefinitions of the C Library interfaces that convert data to and from internal process (wide-character) encoding.
# Map file providing symbols for characters whose binary (1) # encodings are specified in the ISO Latin-1 codeset. (1) <code_set_name> "ISO8859-1" (2) <mb_cur_max>1 (2) <mb_cur_min> 1 (2) <escape_char> \ (2) <comment_char> # (2) CHARMAP (3) <NU> \d000 (4) <SH> \d001 <SX> \d002 <EX> \d003 <ET> \d004 <EQ> \d005 <AK> \d006 <BL> \d007 <BS> \d008
.
.
.
<0> \d048 (4) <1> \d049 <2> \d050 <3> \d051
.
.
.
<A> \d065 (4) <B> \d066 <C> \d067 <D> \d068 <E> \d069
.
.
.
<X> \d088 (4) <Y> \d089 <Z> \d090 <<(> \d091 <//> \d092 <)\>> \d093 <'\>> \d094 <_> \d095 <'!> \d096 <a> \d097 <b> \d098 <c> \d099 <d> \d100 <e> \d101
.
.
.
<x>\d120 (4) <y> \d121 <z> \d122 <(!> \d123 <!!> \d124 <!)> \d125 <'?> \d126 <DT> \d127
.
.
.
<O:> \d214 (4) <U:> \d220
.
.
.
<ss> \d223 (4)
.
.
.
<o:> \d246 (4)
.
.
.
<u:> \d252 (4)
.
.
.
<backspace> \d008 (5) <tab> \d009 <newline> \d010 <vertical-tab> \d011 <form-feed> \d012 <carriage-return> \d013
.
.
.
<space> \d032 (5) <exclamation-mark> \d033 <quotation-mark> \d063 <number-sign> \d035 <dollar-sign> \d036 END CHARMAP (6)
By default, the comment character is the number sign (#). You can override this default with a <comment_char> definition (see 2).
This example provides entries for all valid declarations and specifies default values for all but <code_set_name>. Usually, you specify a declaration only when you want to override its default value. In this example, the declarations for <comment_char> and <escape_char> specify the default values for the comment character and escape character, respectively. The value for <mb_cur_max>, the maximum length (in bytes) of a character, is 1 for this particular locale. The value for <mb_cur_min>, the minimum length (in bytes) of a character, must be 1 in all locales. (All locales include characters in the Portable Character Set, which defines single-byte characters.)
The <code_set_name> value will be the value returned on the nl_langinfo(CODESET) call made by applications that bind to the locale at run time.
Each character map consists of a symbolic name and encoding. The name and encoding are separated by one or more spaces
An encoding can be one or more decimal, octal, or hexadecimal constants. (Multiple constants apply to multibyte encodings.) The constants have the following formats:
You can create multiple symbolic names for the same character (encoding). In this source file, for example, the backspace character (value \d008) has two symbolic names, <BS> and <backspace>. When more than one symbolic name exists for a character, you can specify any of them in locale definition source files to refer to the character.
The source files for codesets with multibyte characters have more complex character maps. Example 7-2 shows a subset of character map entries from a source file for the Japanese SJIS codeset. This source file specifies entries from several character sets that must be supported within the same codeset.
# SJIS charmap # <code_set_name> "SJIS" (1) <mb_cur_min> 1 (2) <mb_cur_max>2 (3) CHARMAP # # CS0: ASCII #
.
.
.
<commercial-at> \x40 (4) <A> \x41 (4) <B> \x42 (4)
.
.
.
# # CS1: JIS X0208-1983 for ShiftJIS. # <zenkaku-space> \x81\x40 (5) <j0101>...<j0163> \x81\x40 (5) <j0164>...<j0194> \x81\x80 (5)
.
.
.
# # UDC Area in JIS X0208 plane # <u8501>...<u8563> \xeb\x40 (6) <u8564>...<u8594> \xeb\x80 (6) <u8601>...<u8663> \xeb\x9f (6)
.
.
.
# # CS2: JIS X0201 (so-called Hankaku-Kana) # <kana-fullstop> \xa1 (7)
.
.
.
<kana-conjunctive> \xa5 (7) <kana-WO> \xa6 (7) <kana-a> \xa7 (7)
.
.
.
END CHARMAP
This value must be 1.
In SJIS, the largest multibyte character is 2 bytes in length.
Note how character symbols are specified as a range and how two hexadecimal values determine the encoding for a 2-byte character.
When symbols are specified as a range of symbol values, the specified character encoding applies to the first symbol in the range. The localedef command automatically increments both the symbol value and the encoding value to create symbols and encodings for all characters in the range.
These maps establish ranges of encodings for which users can later define characters.
The symbolic names for characters in character map source files are in the process of becoming standardized. A future revision of the X/Open UNIX standard will likely specify both long and short symbolic names for characters. Note
The symbolic names for characters shown in this example are not necessarily the names being proposed for adoption by any standards group.
# comment-line (1) comment_char <char_symbol1> (2) escape_char <char_symbol2> (3) CATEGORY_NAME (4) category_definition-statement (5) category_definition-statement (5)
.
.
.
END CATEGORY_NAME (6)
.
.
.
(7)
The number sign (#) is the default comment character. You can specify comments as entire lines by entering the comment character in the first column of the line. You cannot specify comments on the same lines as definition statements in locale source files. In this respect, locale source files differ from character map source files.
You can override the default comment character with an entry line that begins with the comment_char keyword, followed by the symbol for the desired character. The character symbol is defined in the character map (charmap) source file for the locale.
The escape character, by default the backslash (\), is used in decimal, hexadecimal, and octal constants and to indicate when definition statements are continued to the next line of the source file. You can override the default escape character with an entry line that begins with the escape_char keyword, followed by one or more blank characters, then the symbol for the desired character. The character symbol is defined in the character map source file for the locale.
Section headers correspond to category names, which are LC_CTYPE, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_MESSAGES, and LC_TIME.
The format of these statements varies from one category to the next. In general, a statement begins with a keyword, followed by one or more spaces or tabs, then the definition itself.
Section trailers start with the keyword END, followed by the category name.
LC_CTYPE (1) upper <A>;<A:>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;<N>;<O>;\ <O:>;<P>;<Q>;<R>;<S>;<T>;<U>;<U:>;<V>;<W>;<X>;<Y>;<Z> (2) lower <a>;<a:>;<b>;<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;<n>;<o>;\ <o:>;<p>;<q>;<r>;<s>;<ss>;<t>;<u>;<u:>;<v>;<w>;<x>;<y>;<z> (2) alpha <A>;<A:>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;<N>;<O>;\ <O:>;<P>;<Q>;<R>;<S>;<T>;<U>;<U:>;<V>;<W>;<X>;<Y>;<Z>;<a>;<a:>;<b>;\ <c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;<n>;<o>;<o:>;<p>;<q>;<r>;\ <s>;<ss>;<t>;<u>;<u:>;<v>;<w>;<x>;<y>;<z> (2) space <tab>;<newline>;<vertical-tab>;<form-feed>;<carriage-return>;<space>;\ <NS> (2) cntrl <NUL>;...;<IS1>;<DEL>;...;<AC> (2)
.
.
.
toupper (<a>,<A>);(<a:>,<A:>);(<b>,<B>);(<c>,<C>);(<d>,<D>);(<e>,<E>);\ (<f>,<F>);(<g>,<G>);(<h>,<H>);(<i>,<I>);(<j>,<J>);(<k>,<K>);\ (<l>,<L>);(<m>,<M>);(<n>,<N>);(<o>,<O>);(<o:>,<O:>);(<p>,<P>);\ (<q>,<Q>);(<r>,<R>);(<s>,<S>);(<t>,<T>);(<u>,<U>);(<u:>,<U:>);\ (<v>,<V>);(<w>,<W>);(<x>,<X>);(<y>,<Y>);(<z>,<Z>) (3)
.
.
.
END LC_CTYPE (4)
These definitions start with a keyword that stands for the character class, followed by one or more blank characters, then a list of symbols for all characters in that class. You can substitute the character's encoding for its symbol; however, specifying characters by their encodings diminishes the readability of the locale source file and makes it impossible to use the file with more than one codeset.
As shown in the definition of the cntrl class, you can specify a horizontal elipsis (...) to represent a range of characters. In the string <NUL>;...;<IS1>, for example, the ellipsis represents all characters whose encodings are between the character whose symbol is <NUL> and the character whose symbol is <IS1>. The symbols and their encodings are specified in the charmap file for the locale.
The standard character classes are represented by the following keywords:
From the application standpoint, there is also the class alnum. This class is not defined in a locale; it is by definition a combination of characters in the alpha and digit classes.
These definitions, which begin with the keywords toupper and tolower, list symbols in pairs rather than individually. In the toupper definition shown here, the first symbol in the pair is the symbol for a lowercase letter and the second symbol is the symbol for that letter's uppercase equivalent. This definition determines what a letter is converted to when functions perform case conversion on text data.
When you use a copy statement, it must be the only entry between the section trailer and header.
Character classification is language specific. Therefore, the standard character classes may not apply to all languages. Define for a locale only the standard character classes that are appropriate for the locale's language. Depending on the language, it may be necessary to define nonstandardized classes.
A definition for a nonstandardized character class must be preceded by the charclass statement to define a keyword for the class, followed by the class definition. For example:
charclass vowel vowel <a>;<e>;<i>;<o>;<u>;<y>
LC_COLLATE (1) order_start forward;forward;backward (2)
.
.
.
<o> <o>;<o>;<o> (3)
.
.
.
<o:> <o>;<o>;<o:> (3)
.
.
.
<O> <o>;<O>;<O> (3)
.
.
.
<O:> <o>;<O>;<O:> (3)
.
.
.
<Z> <z>;<Z>;<Z> (3)
.
.
.
UNDEFINED IGNORE;IGNORE;IGNORE (4) order_end (5) END LC_COLLATE (6)
Following the order_start keyword on the same line are sort directives, separated by semicolons (;) that apply to each order. Sort directives can include the following keywords.
When a sort directive includes two keywords, the position keyword combined with either forward or backward, the two keywords are separated by a comma (,). The position keyword by itself is equivalent to the directive forward,position.
The number of sort directives corresponds to the number of weights each collating element is assigned in subsequent statements.
Each sort directive and its associated set of weights specify information for one pass, or level, of string comparison. The first directive applies when the string comparison operation applies the primary weight, the second when the string comparison operation applies the secondary weight, and so on. The number of levels required to collate strings correctly depends on language and cultural requirements and therefore varies from one locale to another. There is also a level number maximum, associated with the COLL_WEIGHTS_MAX setting in the limits.h and sys/localedef.h files. On Digital UNIX systems, you are limited to six collation levels (sort directives).
If you do not specify a sort directive, the default is forward.
These statements specify a character symbol, followed by one or more blank characters (spaces or tabs), then the symbols for characters that have the same weight at each stage of the sort. For example, the lowercase character o, lowercase character o umlaut, uppercase character O, and uppercase character O umlaut, whose symbols are <o>, <o:>, <O>, and <O:>, respectively, are grouped together (have the same weight) at the first sort level. At the secondary sort level, lowercase o is grouped with lowercase o umlaut and uppercase O is grouped with uppercase O umlaut. The four characters have distinct weights at the tertiary sort level.
The UNDEFINED keyword begins a collation order statement to be applied to all characters that are defined in the locale's charmap file but not specified in other collation order statements. This statement indicates that such characters are to be ignored during collation for all weight comparisons.
You should include a collation order statement that begins with the UNDEFINED keyword. If this statement is absent, the localedef command includes undefined characters at the end of the collating order and issues a warning.
Furthermore, if you place an UNDEFINED statement as the last collation order statement, the localedef command can sometimes compress all undefined characters into one entry. This action can reduce the size of the locale.
A copy statement can be the only entry between the section trailer and header.
In such cases, you first specify collating-element statements before the order_start statement to define symbols for the strings. You can then specify those symbols in collating order statements. For example:
collating-element <ch> from "<c><h>"
.
.
.
order_start forward;forward;backward
.
.
.
<ch> <Ch>;<ch>;<ch>
.
.
.
You must define each symbolic name by using the collating-symbol statement in the source file before the order_start statement. You then include the symbol in the appropriate position in the list of collation order statements for collating elements. For example, if you wanted the symbols <LOW> to represent the lowest position in the collating order, <LOW> would be the line entry immediately following the order_start statement. A symbol such as <UPPERCASE> would be positioned on the line immediately preceding the section of collating order statements for uppercase letters.
A symbol must occur before the first collation order statement in which it is used. Therefore, you cannot define a symbol for the highest position in the collating order.
After symbols are defined and positioned, you can use them as weights in collating order statements. For example:
collating-symbol <LOWERCASE> collating-symbol <UNACCENTED>
.
.
.
order_start forward;backward;forward;forward
.
.
.
<UNACCENTED>
.
.
.
<LOWERCASE> <a> <a>;<UNACCENTED>;<LOWERCASE>;IGNORE
.
.
.
LC_MESSAGES (1) yesexpr "^[<j><J>][[:alpha:]]*" (2) noexpr "^[<n><N>][[:alpha:]]*" (3) yesstr "<j>" (4) nostr "<n>" (5) END LC_MESSAGES (6)
This entry consists of the yesexpr keyword, followed by one or more spaces or tabs, and an extended regular expression that is delimited by double quotation marks.
This entry consists of the noexpr keyword, followed by one or more spaces or tabs, and an extended regular expression that is delimited by double quotation marks.
This entry consists of the yesstr keyword, followed one or more spaces or tabs, and a string that is delimited by double quotation marks.
This entry consists of the nostr keyword, followed one or more spaces or tabs, and a string that is delimited by double quotation marks.
LC_MONETARY (1) int_curr_symbol "<D><M>" (2) currency_symbol "<D><M>" (2) mon_decimal_point "<,>" (2) mon_thousands_sep "<.>" (2) mon_grouping 3 (2) positive_sign "" (2) negative_sign "<->" (2)
.
.
.
END LC_MONETARY (3)
The entries in the example specify the following:
The following list describes all the symbol names you can define in the LC_MONETARY section:
The international currency symbol
The local currency symbol
The radix character, or decimal point, used in monetary formats
The character used to separate groups of digits to the left of the radix character
The size of each group of digits to the left of the radix character
The string indicating that a monetary value is nonnegative
The string indicating that a monetary value is negative
The number of digits to be written to the right of the radix character when int_curr_symbol appears in the format
The number of digits to be written to the right of the radix character when currency_symbol appears in the format
An integer that determines if the international or local currency symbol precedes a nonnegative value
An integer that determines whether a space separates the international or local currency symbol from other parts of a formatted, nonnegative value
An integer that determines if the international or local currency symbol precedes a negative value
An integer that determines whether a space separates the international or local currency symbol from other parts of a formatted, negative value
An integer that indicates if or how the positive sign string is positioned in a nonnegative, formatted value
An integer that indicates how the negative sign string is positioned in a negative, formatted value
LC_NUMERIC (1) decimal_point "<,>" (2) thousands_sep "<.>" (3) grouping 3 (4) END LC_NUMERIC (5)
The preceding example shows all of the symbols you can define in the LC_NUMERIC section. In place of any symbol definitions, you can specify a copy statement between the section header and trailer to include this section from another locale.
Refer to the locale
(4) reference page for detailed rules about symbol
definitions.
LC_TIME (1) abday "<S><o>";"<M><o>";"<D><i>";"<M><i>";"<D><o>";\ "<F><r>";"<S><a>" (2) day "<S><o><n><n><t><a><g>";"<M><o><n><t><a><g>";\ "<D><i><e><n><s><t><a><g>";\ "<M><i><t><t><w><o><c><h>";\ "<D><o><n><n><e><r><s><t><a><g>";\ "<F><r><e><i><t><a><g>";"<S><a><m><s><t><a><g>" (3) abmon "<J><a><n>";"<F><e><b>";"<M><a:><r>";\ "<A><p><r>";"<M><a><i>";"<J><u><n>";\ "<J><u><l>";"<A><u><g>";"<S><e><p>";\ "<O><k><t>";"<N><o><v>";"<D><e><z>" (4) mon "<J><a><n><u><a><r>";"<F><e><b><r><u><a><r>";\ "<M><a:><r><z>";"<A><p><r><i><l>";"<M><a><i>";\ "<J><u><n><i>";"<J><u><l><i>";\ "<A><u><g><u><s><t>";\ "<S><e><p><t><e><m><b><e><r>";\ "<O><k><t><o><b><e><r>";\ "<N><o><v><e><m><b><e><r>";\ "<D><e><z><e><m><b><e><r>" (5) d_t_fmt "%d.%B %Y %H:%M:%S" (6)
.
.
.
END LC_TIME (7)
Use the %a conversion specifier to include this string in formats.
Use the %A conversion specifier to include this string in formats.
Use the %b conversion specifier to include this string in formats.
Use the %B conversion specifier to include this string in formats.
Use this format to combine field descriptors (whose first character is the percent sign (%)) and symbols for characters. You can specify characters from the Portable Character Set (PCS), such as the period (.) and ASCII space, explicitly as characters rather than implicitly through symbols; however, use symbols to specify all other characters.
The specified format includes the field descriptors for the day of the month (%d), the full name of the month (%B), the full representation of the year (%Y), the number of hours in a 24-hour period (%H), the number of minutes (%M), and the number of seconds (%S). If the date were December 12, 1993, and the time 29 seconds after 12 o'clock in the afternoon, the format specified in this example would cause the date command to display 12.Dezember 1993 12:00:29.
The preceding example includes only some of the symbol definitions that are standard for the LC_TIME category. The following definitions are also standard:
Format for the date alone; corresponds to the %x field descriptor
Format for the time alone; corresponds to the %X field descriptor
Format for the ante meridiem and post meridiem time strings; corresponds to the %p field descriptor
For example, the definition for English would be:
am_pm "<A><M>";"<P><M>"
Format for the time according to the 12-hour clock; corresponds to the %r field descriptor
Definition of how years are counted and displayed for each era (an Asian date construct) in the locale
Format of the date alone in era notation; corresponds to the %Ex field descriptor
Format of the time alone in era notation; corresponds to the %EX field descriptor
Format of both date and time in era notation; corresponds to the %Ec field descriptor
Definition of alternative symbols for digits (used in Asian locales); corresponds to the %O field descriptor
Only locales with multibyte codesets must use methods. When a locale uses methods, there are some methods that the locale must supply and other methods that it can optionally supply. A method is required when the corresponding interface is converting characters between data formats and needs codeset-specific logic to do that operation correctly. A method is optional when the corresponding interface is working with data after it has been converted to wide-character format and can apply logic that is valid for both single-byte and multibyte characters.
Methods must be available on the system in a shareable library. This library and the functions that implement each method in the library are made known to the localedef command through a methods file. When the localedef command processes the methods file along with the charmap and locale source files, the resulting locale includes pointers to all methods that are supplied with the locale, along with pointers to default implementations for optional methods that are not supplied with the locale. When you set the LANG variable to the newly built locale and run a command or application, methods are used wherever they have been enabled in the system software.
This method is similar to the one for mbstowcs (see Section 7.3.1.6) but contains additional parameters to meet the needs of fgetws(). By convention, a C source file for this method has the file name __mbstopcs_codeset .c, where codeset identifies the codeset for which the method is tailored. Example 7-10 shows the file __mbstopcs_sdeckanji.c that defines the __mbstopcs method used with the ja_JP.sdeckanji locale.
#include <stdlib.h> (1) #include <wchar.h> (1) #include <sys/localedef.h> (1) int __mbstopcs_sdeckanji( wchar_t *pwcs, (2) size_t pwcs_len, (3) const char *s, (4) size_t s_len, (5) int stopchr, (6) char **endptr, (7) int *err, (8) _LC_charmap_t *handle ) (9) { int cnt = 0; (10) int pwcs_cnt = 0; (10) int s_cnt = 0; (10) *err = 0; (11) while (1) { (12) if (pwcs_cnt >= pwcs_len || s_cnt >= s_len) { *endptr = (char *)&(s[s_cnt]); break; } (13) if ((cnt = __mbtopc_sdeckanji(&(pwcs[pwcs_cnt]), &(s[s_cnt]), (s_len - s_cnt), err)) == 0) { *endptr = (char *)&(s[s_cnt]); break; } (14) pwcs_cnt++; (15) if ( s[s_cnt] == (char) stopchr) { *endptr = (char *)&(s[s_cnt+1]); break; } (16) s_cnt += cnt; (17) } (18) return (pwcs_cnt); (19) }
This parameter is needed because the fgetws() function reads from the standard I/O buffer, which does not contain null-terminated strings.
This value, typically \n, is passed to the method on the call from the fgetws() function, which handles only one line of input per call.
This pointer is needed to specify the starting character in the standard I/O buffer for the next call to fgetws().
The localedef command creates and stores values in the _LC_charmap_t structure.
The err variable contains the return status of the call to the mbtopc method:
In this case, the return is the number of bytes required to form a valid character. The fgetws() function can then refill the buffer and try again.
#include <stdlib.h> (1) #include <wchar.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: PC = s[0] s[0] = 0x8e: PC = s[1] + 0x5f; s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c s[0] > 0xa1:0xa1 < s[1] < 0xfe PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e 0x21 < s[1] < 0x7e PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ (2) int __mbtopc_sdeckanji( wchar_t *pwc, (3) char *ts, (4) size_t maxlen, (5) int *err, (6) _LC_charmap_t *handle ) (7) { wchar_t dummy; (8) unsigned char *s = (unsigned char *)ts; (9) if (s == NULL) return(0); (10) if (pwc == (wchar_t *)NULL) pwc = &dummy; (11) *err = 0; (12) if (s[0] <= 0x8d) { if (maxlen < 1) { *err = 1; return(0); } else { *pwc = (wchar_t) s[0]; return(1); } } (13) else if (s[0] == 0x8e) { if (maxlen >= 2) { if (s[1] >=0xa1 && s[1] <=0xfe) { *pwc = (wchar_t) (s[1] + 0x5f); return(2); } } else { *err = 2; return(0); } } (14) else if (s[0] == 0x8f) { if (maxlen >= 3) { if ((s[1] >=0xa1 && s[1] <=0xfe) && (s[2] >=0xa1 && s[2] <= 0xfe)) { *pwc = (wchar_t) (((s[1] - 0xa1) << 7) | (wchar_t) (s[2] - 0xa1)) + 0x303c; return(3); } } else { *err = 3; return(0); } } (15) else if (s[0] <= 0x9f) { if (maxlen < 1) { *err = 1; return(0); } else { *pwc = (wchar_t) s[0]; return(1); } } (16) else if (s[0] >= 0xa1 && s[0] <= 0xfe) { if (maxlen >= 2) { if (s[1] >=0xa1 && s[1] <= 0xfe) { *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t) (s[1] - 0xa1)) + 0x15e; return(2); } else if (s[1] >=0x21 && s[1] <= 0x7e) { *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t) (s[1] - 0x21)) + 0x5f1a; return(2); } } else { *err = 2; return(0); } } (17) *err = -1; return(0); (18) }
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid.
This value is passed by the calling function.
This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as int, to a large signed data type, such as char. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:
if (s[0] <= 0x8d
This operation ensures that *pwc always points to a valid address; otherwise, an application could produce a segmentation fault by referring to this pointer when a wide character has not been stored in pwc.
If s contains no characters, returns zero (0) to indicate that no bytes were converted and sets err to 1 to indicate that 1 byte is needed to form a valid character.
If the byte value is in the range being tested, moves the associated process code value to pwc and returns 1 to indicate the number of bytes converted.
If yes, moves the associated process code value to the pwc buffer and returns 2 to indicate the number of bytes converted; otherwise, returns 0 to indicate that no conversion took place and sets err to 2 to specify that at least 2 bytes are needed to form a valid character.
If yes, moves the associated process code value to pwc and returns 3 to indicate the number of bytes converted; otherwise, sets err to 3 to indicate that at least 3 bytes are needed and returns zero (0) to indicate that no character was converted.
If there are no bytes in the standard I/O buffer, returns zero (0) to indicate that no bytes were converted and sets err to 1 to indicate that at least 1 byte is needed to form a valid character.
If the byte value is in the defined range, moves the associated process code value to pwc and returns 1 to indicate the number of bytes converted.
If yes, moves the associated process code value to pwc buffer and returns 2 to indicate the number of bytes converted; otherwise, sets err to 2 to indicate that at least 2 bytes are needed to form a valid character and returns zero (0) to indicate that no bytes were converted.
These statements execute if the multibyte data in s satisfies none of the preceding if conditions.
int __pcstombs_sdeckanji() { return -1; (1) }
This return causes the fputws() function to use multiple calls to putwc() to convert wide characters in the string.
If you choose to implement this method fully rather than writing it to return -1, your function implementation returns the number of wide characters converted and must include header files and parameters as shown in the following example:
#include <stdlib.h> #include <wchar.h> #include <sys/localedef.h> int __pcstombs_newcodeset( wchar_t *pcsbuf, (1) size_t pcsbuf_len, (2) char *mbsbuf, (3) size_t mbsbuf_len, (4) char **endptr, (5) int *err, (6) _LC_charmap_t *handle ) (7)
This value is passed to the method on the call from fputws().
This value is passed to the method on the call from fputws().
If this method calls the wctomb method to perform the character conversion, the wctomb method sets this status. Otherwise, this method must incorporate the logic to perform wide-character to multibyte-character conversion and set the status directly.
In any event, the fputws() function expects the following values:
In this case, the value is the number of bytes required to store the next character. The fputws() function can then empty the multibyte-character buffer and try again.
The __pcstombs method performs the reverse of the operation that the __mbstopcs method described in Section 7.3.1.3 performs. Because of the direction of the data conversion, the __pcstombs method:
int __pctomb_sdeckanji() { return -1; (1) }
#include <stdlib.h> (1) #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: 1 byte s[0] = 0x8e: 2 bytes s[0] = 0x8f 3 bytes s[0] > 0xa1 2 bytes | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ (2) int __mblen_sdeckanji( char *fs, (3) size_t maxlen, (4) _LC_charmap_t *handle ) (5) { const unsigned char *s = (void *) fs; (6) if (s == NULL || *s == '\0') return(0); (7) if (maxlen < 1) { _Seterrno(EILSEQ); return((size_t)-1); } (8) if (s[0] <= 0x8d) return(1); (9) else if (s[0] == 0x8e) { if (maxlen >= 2 && s[1] >=0xa1 && s[1] <=0xfe) return(2); } (10) else if (s[0] == 0x8f) { if(maxlen >=3 && (s[1] >=0xa1 && s[1] <=0xfe) && (s[2] >=0xa1 && s[2] <= 0xfe)) return(3); } (11) else if (s[0] <= 0x9f) return(1); (12) else if (s[0] >= 0xa1) { if (maxlen >=2 && (s[0] <= 0xfe) ) if ( (s[1] >=0xa1 && s[1] <= 0xfe) || (s[1] >=0x21 && s[1] <= 0x7e) ) return(2); } (13) _Seterrno(EILSEQ); return((size_t)-1); (14) }
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid.
This value is passed to the method by the mblen() function.
This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as int, to a large signed data type, such as char. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:
if (s[0] <= 0x8d
To set errno in a way that works correctly with multithreaded applications, use _Seterrno rather than an assignment statement.
If yes, returns 1 to indicate that the character length is 1 byte.
If yes, returns 2 to indicate that the character length is 2 bytes.
If yes, returns 3 to indicate that the character length is 3 bytes.
If yes, returns 1 to indicate that the character length is 1 byte.
If yes, returns 2 to indicate that the character length is 2 bytes.
These statements execute if the multibyte data in the standard I/O buffer satisfies none of the preceding if conditions.
#include <stdlib.h> (1) #include <wchar.h> #include <sys/localedef.h> size_t __mbstowcs_sdeckanji( wchar_t *pwcs, (2) const char *s, (3) size_t n, (4) _LC_charmap_t *handle ) (5) { int len = n; (6) int rc; (7) int cnt; (8) wchar_t *pwcs0 = pwcs; (9) int mb_cur_max; (10) if (s == NULL) return (0); (11) mb_cur_max = MB_CUR_MAX; (12) if (pwcs == (wchar_t *)NULL) { cnt = 0; while (*s != '\0') { if ((rc = __mblen_sdeckanji(s, mb_cur_max, handle)) == -1) return(-1); cnt++ ; s += rc; } return(cnt); } (13) while (len-- > 0) { if ( *s == '\0') { *pwcs = (wchar_t) '\0'; return (pwcs - pwcs0); } if ((cnt = __mbtowc_sdeckanji(pwcs, s, mb_cur_max, handle)) < 0) return(-1); s += cnt; ++pwcs; } (14) return (n); (15) }
The programmer can request the size of the pwcs buffer (for memory allocation purposes) by passing a null wide character as the pwcs parameter in the call to mbstowcs(). The programmer can then use the return value to efficiently allocate memory space for the application's wide-character buffer before calling mbstowcs() again to actually convert the multibyte string.
Stops processing and returns the number of wide characters in the pwcs buffer if a NULL is encountered; increments the byte position in the multibyte character buffer by an appropriate number each time a character is successfully converted.
This while loop uses the condition len-- > 0 to ensure that processing stops when the pwcs buffer is full. The first if condition in the loop makes sure that, if the multibyte string in the s buffer is null terminated, the associated null terminator in the pwcs buffer is not included in the wide-character count that the mbtowcs() function returns to the application.
This statement executes if the pwcs buffer runs out of space before a NULL is encountered in the s buffer.
#include <stdlib.h> (1) #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: s[0] < 0x9f: PC = s[0] s[0] = 0x8e: PC = s[1] + 0x5f; s[0] = 0x8f PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c s[0] > 0xa1:0xa1 < s[1] < 0xfe PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e 0x21 < s[1] < 0x7e PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ (2) int __mbtowc_sdeckanji( wchar_t *pwc, (3) const char *ts, (4) size_t maxlen, (5) _LC_charmap_t *handle ) (6) { unsigned char *s = (unsigned char *)ts; (7) wchar_t dummy; (8) if (s == NULL) return(0); (9) if (maxlen < 1) { _Seterrno(EILSEQ); return((size_t)-1); } (10) if (pwc == (wchar_t *)NULL) pwc = &dummy; (11) if (s[0] <= 0x8d) { *pwc = (wchar_t) s[0]; if (s[0] != '\0') return(1); else return(0); } (12) else if (s[0] == 0x8e) { if ( (maxlen >= 2) && ((s[1] >=0xa1) && (s[1] <=0xfe))) { *pwc = (wchar_t) (s[1] + 0x5f); /* 0x100 - 0xa1 */ return(2); } } (13) else if (s[0] == 0x8f) { if((maxlen >= 3) && (((s[1] >=0xa1) && (s[1] <=0xfe)) && ((s[2] >=0xa1) && (s[2] <= 0xfe)))) { *pwc = (wchar_t) (((s[1] - 0xa1) << 7) | (wchar_t) (s[2] - 0xa1)) + 0x303c; return(3); } } (14) else if (s[0] <= 0x9f) { *pwc = (wchar_t) s[0]; if (s[0] != '\0') return(1); else return(0); } (15) else if (((s[0] >= 0xa1) && (s[0] <= 0xfe)) && (maxlen >= 2)){ if (((s[1] >=0xa1) && (s[1] <= 0xfe))){ *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t)(s[1] - 0xa1)) + 0x15e; return(2); } else if (((s[1] >=0x21) && (s[1] <= 0x7e))){ *pwc = (wchar_t) (((s[0] - 0xa1) << 7) | (wchar_t)(s[1] - 0x21)) + 0x5f1a; return(2); } } (16) _Seterrno(EILSEQ); return(-1); (17) }
The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid.
This value is passed from the calling function; the value will have been set to MB_CUR_MAX on the original call made by the application programmer.
This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as int, to a large signed data type, such as char. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:
if (s[0] <= 0x8d
If passed a null pointer, this method should return a value to indicate whether the locale's character encoding is stateful or stateless. Return a nonzero value if your locale's character encoding is stateful.
This operation ensures that pwc always points to a valid address; otherwise, an application could produce a segmentation fault by referring to this pointer when a wide character has not been stored in pwc.
If yes, stores the associated process code value in the pwc buffer and returns 1 to indicate that the character length is 1 byte.
If yes, stores the associated process code value in the pwc buffer and returns 2 to indicate that the character length is 2 bytes.
If yes, stores the associated process code value in the pwc buffer and returns 3 to indicate that the character length is 3 bytes.
If yes, stores the associated process code value in the pwc buffer and returns 1 to indicate that the character length is 1 byte.
If yes, stores the associated process code value in the pwc buffer and returns 2 to indicate that the character length is 2 bytes.
These statements execute if the multibyte data in the s buffer satisfies none of the preceding if conditions.
#include <stdlib.h> (1) #include <wchar.h> #include <limits.h> #include <sys/localedef.h> size_t __wcstombs_sdeckanji( char *s, (2) const wchar_t *pwcs, (3) size_t n, (4) _LC_charmap_t *handle ) (5) { int cnt=0; (6) int len=0; (7) int i=0; (8) char tmps[MB_LEN_MAX+1]; (9) if ( s == (char *)NULL) { cnt = 0; while (*pwcs != (wchar_t)'\0') { if ((len = __wctomb_sdeckanji(tmps, *pwcs)) == -1) return(-1); cnt += len; pwcs++; } return(cnt); } (10) if (*pwcs == (wchar_t)'\0') { *s = '\0'; return(0); } (11) while (1) { (12) if ((len = __wctomb_sdeckanji(tmps, *pwcs)) == -1) return(-1); (13) else if (cnt+len > n) { *s = '\0'; break; } (14) if (tmps[0] == '\0') { *s = '\0'; break; } (15) for (i=0; i<len; i++) { *s = tmps[i]; s++; } (16) cnt += len; (17) if (cnt == n) break; (18) pwcs++; (19) } (20) if (cnt == 0) cnt = len; (21) return (cnt); (22) }
This value is supplied by the calling function.
If yes, calls the wctomb method to calculate the number of bytes required for converted characters (excluding the null terminator) in the multibyte-character buffer.
The programmer can request the size of the s buffer (for memory allocation purposes) by passing a null byte as the data in the s parameter on the call to wcstombs(). The programmer can then use the return value to efficiently allocate memory space for the application's wide-character buffer before calling wcstombs() again to actually convert the wide-character string.
#include <stdlib.h> (1) #include <wchar.h> #include <sys/errno.h> #include <sys/localedef.h> /* The algorithm for this conversion is: PC <= 0x009f: s[0] = PC PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e s[1] = PC - 0x005f PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1 s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1 PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f s[1] = ((PC - 0x303c) >> 7) + 0x00a1 s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1 PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1 s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021 +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ (2) int __wctomb_sdeckanji( char *s, (3) wchar_t wc, (4) _LC_charmap_t *handle ) (5) { if (s == (char *)NULL) return(0); (6) if (wc <= 0x9f) { s[0] = (char) wc; return(1); } (7) else if ((wc >= 0x0100) && (wc <= 0x015d)) { s[0] = 0x8e; s[1] = wc - 0x5f; return(2); } (8) else if ((wc >=0x015e) && (wc <= 0x303b)) { s[0] = (char) (((wc - 0x015e) >> 7) + 0x00a1); s[1] = (char) (((wc - 0x015e) & 0x007f) + 0x00a1); return(2); } (9) else if ((wc >=0x303c) && (wc <= 0x5f19)) { s[0] = 0x8f; s[1] = (char) (((wc - 0x303c) >> 7) + 0x00a1); s[2] = (char) (((wc - 0x303c) & 0x007f) + 0x00a1); return(3); } (10) else if ((wc >=0x5f1a) && (wc <= 0x8df7)) { s[0] = (char) (((wc - 0x5f1a) >> 7) + 0x00a1); s[1] = (char) (((wc - 0x5f1a) & 0x007f) + 0x0021); return(2); } (11) _Seterrno(EILSEQ); return(-1); (12) }
Each character set supported by the codeset corresponds to a unique range of wide-character (process code) values and, within each character set, multibyte characters are of uniform length (1, 2, or 3 bytes). Therefore, the range in which each wide-character value falls indicates the number of bytes required for the character in multibyte format; the wide-character value itself determines the specific byte value or values for the character in multibyte format.
These statements execute if the wide-character values satisfies none of the preceding conditions.
#include <stdlib.h> (1) #include <wchar.h> #include <sys/localedef.h> /* The algorithm for this conversion is: PC <= 0x009f: s[0] = PC PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e s[1] = PC - 0x005f PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1 s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1 PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f s[1] = ((PC - 0x303c) >> 7) + 0x00a1 s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1 PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1 s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021 +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ (2) int __wcswidth_sdeckanji( const wchar_t *wcs, (3) size_t n, (4) _LC_charmap_t *hdl ) (5) { int len; (6) int i; (7) if (wcs == (wchar_t *)NULL || *wcs == (wchar_t)NULL) return(0); (8) len = 0; (9) for (i=0; wcs[i] != (wchar_t)NULL && i<n; i++) { (10) if (wcs[i] <= 0x9f) len += 1; (11) else if ((wcs[i] >= 0x0100) && (wcs[i] <= 0x015d)) len += 1; (12) else if ((wcs[i] >=0x015e) && (wcs[i] <= 0x303b)) len += 2; (13) else if ((wcs[i] >=0x303c) && (wcs[i] <= 0x5f19)) len += 2; (14) else if ((wcs[i] >=0x5f1a) && (wcs[i] <= 0x8df7)) len += 2; (15) else return(-1); (16) } (17) return(len); (18) }
Note that each character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns.
This statement executes if a value that satisfies none of the preceding conditions is encountered in the string. The calling function, wcswidth(), also returns -1 if the wide character is nonprintable; however, this condition is evaluated at the level of the calling function and does not need to be evaluated by the method.
#include <stdlib.h> (1) #include <wchar.h> #include <sys/localedef.h> /* The algorithm for this conversion is: PC <= 0x009f: s[0] = PC PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e s[1] = PC - 0x005f PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1 s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1 PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f s[1] = ((PC - 0x303c) >> 7) + 0x00a1 s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1 PC >= 0x5f1a and PC <=0x8df7 s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1 s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021 +-----------------+-----------+-----------+-----------+ | process code | s[0] | s[1] | s[2] | +-----------------+-----------+-----------+-----------+ | 0x0000 - 0x009f | 0x00-0x9f | -- | -- | | 0x00a0 - 0x00ff | -- | -- | -- | | 0x0100 - 0x015d | 0x8e | 0xa1-0xfe | -- | JIS X0201 RH | 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe | -- | JIS X0208 | 0x303c - 0x5f19 | 0x8f | 0xa1-0xfe | 0xa1-0xfe | JIS X0212 | 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe | -- | UDC +-----------------+-----------+-----------+-----------+ */ (2) int __wcwidth_sdeckanji( wint_t wc, (3) _LC_charmap_t *hdl ) (4) { if (wc == 0) return(0); (5) if (wc <= 0x9f) return(1); (6) else if ((wc >= 0x0100) && (wc <= 0x015d)) return(1); (7) else if ((wc >=0x015e) && (wc <= 0x303b)) return(2); (8) else if ((wc >=0x303c) && (wc <= 0x5f19)) return(2); (9) else if ((wc >=0x5f1a) && (wc <= 0x8df7)) return(2); (10) return(-1); (11) }
Note that a character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns.
The calling function, wcwidth(), also returns -1 if the wide character is nonprintable; however, this condition is evaluated at the level of the calling function and does not need to be evaluated by the method.
Writing optional methods requires detailed information about the internal interfaces to C library routines. This information is proprietary to Digital and may be subject to change. In the rare cases where your locale must include an optional method, contact your Digital technical support representative to request information.
Example 7-21 shows the compiler and linker command lines that are required to build the method source files into a shareable library that is used with the ja_JP.sdeckanji locale.
cc -std0 -c \ __mblen_sdeckanji.c __mbstopcs_sdeckanji.c \ __mbstowcs_sdeckanji.c __mbtopc_sdeckanji.c \ __mbtowc_sdeckanji.c __pcstombs_sdeckanji.c \ __pctomb_sdeckanji.c __wcstombs_sdeckanji.c \ __wcswidth_sdeckanji.c __wctomb_sdeckanji.c \ __wcwidth_sdeckanji.c ld -shared -set_version osf.1 -soname libsdeckanji.so -shared \ -no_archive -o libsdeckanji.so \ __mblen_sdeckanji.o __mbstopcs_sdeckanji.o \ __mbstowcs_sdeckanji.o __mbtopc_sdeckanji.o \ __mbtowc_sdeckanji.o __pcstombs_sdeckanji.o __pctomb_sdeckanji.o \ __wcstombs_sdeckanji.o __wcswidth_sdeckanji.o __wctomb_sdeckanji.o \ __wcwidth_sdeckanji.o \ -lc
cc
(1) and ld
(1) reference pages for more information
about the cc and ld commands and how you build shared
libraries.
Example 7-22 shows the section of a methods file for the methods used with the ja_JP.sdeckanji locale. Because there is a mandatory list of methods that you must define if you want to override any C library interfaces, your methods file must always specify an entry for each of the required methods as shown in this example. The ja_JP.sdeckanji locale relies on default implementations for all optional methods, so Example 7-22 does not contain entries for any of the optional methods.
# sdeckanji.m (1) # <method_keyword> "<entry>" "<package>" "<library_path>" (1) METHODS (2) __mbstopcs "__mbstopcs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) __mbtopc "__mbtopc_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) __pcstombs "__pcstombs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) __pctomb "__pctomb_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) mblen "__mblen_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) mbstowcs "__mbstowcs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) mbtowc "__mbtowc_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) wcstombs "__wcstombs_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) wcswidth "__wcswidth_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) wctomb "__wctomb_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) wcwidth "__wcwidth_sdeckanji" "libsdeckanji.so" \ "/usr/shlib/libsdeckanji.so" (3) END METHODS (4)
These lines specify the name of the methods file and the format of method entries. Note that the field identified in the format as <package> is ignored, but you must specify some string for this field in order to specify a library path.
Refer to the localedef
(1) reference page for detailed information
about methods file entries.
% localedef -f ISO8859-1.cmap \ (1) -i de_DE.ISO8859-1.lscr \ (2) de_DE.ISO8859-1@example (3)
When you are testing locales, particularly ones that are similar to standard locales installed on the system, you should add an extension to the locale name. Varying names with the at (@) extension allows you to specify the standard strings for language, territory, and codeset and still be sure that the test locale is uniquely identified. This is important if you later decide to move the locale to the directory /usr/lib/nls/loc where other locales reside.
Example 7-23 shows only one form and a few options
for the localedef command. The localedef
(1) reference page is
a complete description of the command. The following is a summary of some
important rules and options:
By default, locales must reside in the /usr/lib/nls/loc directory to be found. If you want to test your locale before moving it to the /usr/lib/nls/loc directory, you can define the LOCPATH variable to specify the directory where your locale is located. You can then define the LANG environment variable to be your new locale and interactively test the locale with commands and applications.
Example 7-24 uses the date command to test the date/time format.
% setenv LOCPATH ~harry/locales % setenv LANG de_DE.ISO8859-1@example % date 12.Dezember 1993 09:18:11
The LOCPATH variable is an extension to specifications in the X/Open UNIX standard and therefore may not be recognized on all systems that conform to this standard. Note
Some programs have support files that are installed in system directories with names that exactly match the names of standard locales. In such cases, application software, system software, or both might use the value of the LANG environment variable to determine the locale-specific directory in which the support files reside. If assigned directly to the LANG or LC_ALL environment variable, locale file names with an at (@) suffix may result in invalid search paths for some applications. The following example shows how you can work around this problem by assigning the standard locale name to the LANG variable and the name of your variant locale to the locale category variables. You need to make assignments only to those category variables that represent areas where your locale differs from the locale on which it is based.
% setenv LANG de_DE.ISO8859-1 % setenv LC_CTYPE de_DE.ISO8859-1@example % setenv LC_COLLATE de_DE.ISO8859-1@example
.
.
.
% setenv LC_TIME de_DE.ISO8859-1@example