Unicode(5)

Index for
Section 5
Alphabetical
listing for U
Bottom of
page
Unicode(5)
NAME
  Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32,
  iso10646 - Support for the Unicode and ISO/IEC 10646 standards

DESCRIPTION
  The operating system provides locales and codeset converters that support
  the following standards:

    ·  The Unicode Standard, Version 3.0, Unicode, Inc., 2000

    ·  The Unicode Standard, Version 3.1, Unicode, Inc., 2001

    ·  Information Technology-Universal Multiple-Octet Coded Character Set,
       ISO/IEC 10646:2001

       The Basic Multilingual Plane defined by this standard is identical
       with the main body of Unicode character encoding.

  These standards define generalized character encoding rules that can be
  applied to characters in most native language scripts. The Unicode standard
  specifies a universal character set (UCS). Version 3.0 of the Unicode
  standard contains definitions for 49,194 characters and also includes a
  Private Use Area for vendor- or user-defined characters. Version 3.1 of the
  Unicode standard adds 44,946 new character definitions, incorporates UTF-32
  (32-bit encoding) into the standard, and adds three new planes beyond the
  16-bit codespace of Plane 0 (Basic Multilingual Plane). Plane 1
  (Supplementary Multilingual Plane) contains code positions U+10000 to
  U+1FFFF; Plane 2 (Supplementary Ideographic Plane) contains code positions
  U+20000 to U+2FFFF; Plane 14 (Supplementary Special-Purpose Plane) contains
  code positions U+E0000 to U+EFFFF.

  See the Unicode web site at http://www.unicode.org/ for more information on
  the Unicode standard.	 See the Unicode ReadMe document in
  /usr/share/unidata/, which describes the Unicode standard version currently
  supported on the operating system.

  The following list summarizes the main features of the Unicode character
  set:

    ·  Characters have properties, such as base, numeric, spacing,
       combination, and directionality. The Unicode standard provides rules
       for ordering characters with different properties so that parsing of
       character sequences is unambiguous.

    ·  The relationship between Unicode characters and the glyphs in the
       native language script that users see, type, or print is not
       necessarily one-to-one. A glyph may be mapped to a single abstract
       character or to a composed character. Conversely, more than one glyph
       can be mapped to a character.

    ·  Certain sequences of Unicode characters in a text stream are
       transformed into other characters, called composed characters.

    ·  The ISO 8859-1 character set occupies the first 256 code positions
       (and the ASCII character set the first 128 positions) of the UCS.

  The Unicode and ISO/IEC 10646 standards specify a universal repertoire of
  characters that can be used by all major languages and that allow character
  units to be processed for all languages under the same set of rules.
  Therefore, system support for the universal character set does not need to
  include multiple algorithms (one or more per language) for converting
  between file code and internal process code. However, the two different
  character sizes (16-bit or 32-bit) that the standards support require
  different parsing schemes for data input and output. Universal character
  encoding that an implementation parses in 16-bit units (2 octets) is known
  as UCS-2. Universal character encoding that an implementation parses in
  32-bit units (4 octets) is known as UCS-4.  This is the canonical ISO/IEC
  10646 encoding that is in use on systems that can support the larger data
  unit size.

  Because UCS-2 is a subset of UTF-16, the operating system supports UCS-2
  with UTF-16 codeset converters. The operating system supports UCS-4 with
  both codeset converters and locales. (Keep in mind that UCS-2 cannot be
  used to encode characters outside of the Basic Multilingual Plane.)

  In terms of locales, the operating system supports both Unicode and dense
  code. The two types of locales differ in their manner of wide character
  encoding support. See l10n_intro(5) for information comparing the two
  locale types and for information on switching between Unicode and dense
  code locales.

  The Unicode and ISO/IEC 10646 standards define a number of transformation
  formats for the universal character set (UTF-8 and UTF-32 are the preferred
  transformation formats for the operating system):

    ·  UTF-8, the standard method for transforming UCS-4 process encoding
       into a sequence of 8-bit bytes and ensuring interchange transparency
       for characters in C0 code positions (0 to 31), the SPACE (32)
       character, and the DEL (127) character

       The operating system supports UTF-8 with both codeset converters and
       locales.

    ·  UTF-7, an obsolete interchange format for environments that strip the
       eighth bit from each byte

       The operating system does not support UTF-7.

    ·  UTF-1, an obsolete interchange format that is similar to UTF-8 but
       also ensures interchange transparency of characters in C1 code
       positions (128 to 159)

       The operating system does not support UTF-1.

    ·  UTF-16, which uses the surrogate character extension technique defined
       by Version 2.0 and later of the Unicode standard and represents
       characters in 16-bit units

       UTF-16 is a superset of UCS-2. As with UCS-2, UTF-16 encodes
       characters in the range U+0000 to U+FFFF as single 16-bit units. For
       characters in the range U+10000 to U+10FFFF, UTF-16 transforms them
       into a surrogate pair. The result of this transformation is that the
       high surrogate (the first of the pair) is in the range U+D800 to
       U+DBFF, while the low surrogate (the second part of the pair) is in
       the range U+DC00 to U+DFFF. These two 16-bit values represent a single
       character.

       Although UTF-16 does not support representation of the entire UCS-4
       code space (including private-use ranges for character values above
       U+10FFFF), it does supports all characters  that have been currently
       defined for the languages covered by both standards.

       Byte orientation in file code can differ and, depending on the
       platform on which the file was generated, can be little-endian (LE) or
       big-endian (BE).	 UTF-16 uses a byte order mark (BOM), which is not
       part of the file text data, to indicate byte orientation. The code
       point of the BOM is U+FEFF. The Unicode standard also defines UTF-16LE
       and UTF-16BE, which are specific to the little-endian and big-endian
       orientations, respectively, and do not include a byte order mark.

       The operating system supports UTF-16, UTF-16LE, and UTF-16BE through
       codeset converters. The codeset converter name, UCS-2 is recognized as
       an alias for UTF-16*, but with a restricted repertoire of characters.

					Note

	 By default, the operating system uses UTF-16 rather than UTF-16LE or
	 UTF-16BE.

	 In an input file, the software first looks for a BOM. If a BOM is
	 not found, the converter assumes UTF-16BE. This means that you must
	 explicitly specify UTF-16LE to the converter (convert files
	 manually) when UTF-16LE applies to an input file.

	 For an output file, the converter automatically inserts a BOM. This
	 means that you must explicitly specify UTF-16LE or UTF-16BE (convert
	 files manually) when you want conversion output to be UTF-16LE or
	 UTF-16BE rather than UTF-16.

    ·  UTF-32 allows character representation in 4-byte encoding units

       UTF-32 is a restricted subset of UCS-4. UTF-32 is restricted in values
       to the range U+0000 to U+10FFFF, which precisely matches the range of
       character values defined by UTF-16. Like UTF-16, UTF-32 does not
       support private-use ranges for character values above U+10FFFF.

       UTF-32 uses a BOM to indicate little-endian or big-endian byte
       orientation.  The Unicode standard also defines UTF-32LE and UTF-32BE,
       which are specific to the little-endian and big-endian orientations,
       respectively, and do not include a BOM. As with UTF-16, big-endian is
       the default byte order when a BOM is not generated.

       UTF-32 is almost the same as UCS-4, so you can use UCS-4 codeset
       converters to process UTF-32. UCS-4 converter software includes
       support for UTF-32, UTF-32LE, or UTF-32BE.

  Codeset Conversion

  Codeset converters are available to convert data in all the major encoding
  formats that the operating system supports to and from UCS-2, UTF-16, UCS-
  4, and UTF-8. If the worldwide support subsets are installed on your
  system, you can enter the following commands to find the names of these
  converters:

       % cd /usr/lib/nls/loc/iconv
       % ls | grep UTF
       % ls | grep UCS

  Among the converters listed, you will find some that handle conversion of
  data in the code-page format used on PC systems. See code_page(5) for more
  information about converting between codeset and code-page formats. You can
  use all codeset converters with the iconv command and associated library
  functions.

				     Note

       The mapping of Korean Hangul characters changed between Version 1.1
       and Version 2.0 of the Unicode standard. By default, UTF-16, UCS-4,
       and UTF-8 conversion assumes Version 2.0 character mapping for Hangul
       characters. Therefore, if data is in Version 1.1 format, you must
       first convert the data to Version 2.0 format before converting from
       UTF-16, UCS-4, or UTF-8 to an entirely different format.

       The format of a codeset converter name is from-codeset_to-codeset. In
       converter names, the Version 1.1 codeset formats for UCS-2, UCS-4, and
       UTF-8 are represented by UNICODE-1-1, UNICODE-1-1-UCS-4, and UNICODE-
       1-1-UTF-8, respectively. The Version 2.0 codeset names are represented
       by UTF-16, UCS-4, and UTF-8.

       For example, if Korean data is currently in UCS-4 Version 1.1 format,
       the data must first be processed by the UNICODE-1-1-UCS-4_UCS-4
       converter before being processed by the UCS-4_deckorean converter.

  See iconv_intro(5) for general information on codeset conversion.

  Locales

  The following locales use UTF-32 as internal processing code:

    ·  universal.UTF-8

       This locale is used by applications. It converts data in UTF-8 file
       format to UCS-4 process code and can be used to test any UCS-4
       character to determine if it is included in one of the following
       classes defined for the LC_CTYPE category: alnum, alpha, blank, cntrl,
       digit, graph, lower, print, punct, space, upper, or xdigit.

       In the universal.UTF-8 locale, the LC_MESSAGES, LC_MONETARY,
       LC_NUMERIC, and LC_TIME category definitions match those for the POSIX
       (C) locale.

    ·  language_territory.UTF-8

       These locales limit classification information to the characters in a
       particular native language, make country-specific data available to
       the application, and assume file data follows UTF-8 encoding rules.
       The operating system locales that support the euro monetary symbol use
       either the UTF-8 or ISO8859-15 codeset. See euro(5) for more
       information.

					Note

	 The X locale database file used by applications running in the
	 universal.UTF-8, en_US.UTF-8, or Asian locales (Chinese, Japanese,
	 or Korean) contains font definitions that include all the fonts used
	 with the operating system. This enables applications under
	 en_US.UTF-8 to display all the font characters installed with
	 Worldwide Language Software (WLS). Applications under the Asian
	 locales display all the font characters installed with WLS, except
	 for ISO8859-2, -4, -5, -7, -8, -9, -15, and TACTIS.

    ·  native_locale_name

       These locales are installed in the default Unicode path,
       /usr/i18n/lib/nls/ucsloc/ and use UTF-32 as internal processing code.
       However, they differ in the following ways:

	 -- The file code is specified by the codeset portion (for example,
	    ISO8859-1) of native_locale_name.

	 -- Classification information is not provided for the full set of
	    UTF-32 characters, but only for those in a particular native
	    language (for example, French).

	 -- Country-specific data is also available to the application.	 The
	    LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME
	    category definitions match those defined in native_locale_name.

    ·  native_locale_name@ucs4

       These locales are installed in /usr/i18n/lib/nls/loc/ and are the same
       as the native_locale_name locales installed in
       /usr/i18n/lib/nls/ucsloc/ except that they are not a complete set of
       locales and will not be enhanced in future versions of the operating
       system. They are provided for compatibility with existing
       applications. You cannot select @ucs4 locales from the CDE login menu;
       you must specify the locale name in the LANG environment variable.

  CDE desktop users can select .UTF-8 locales by choosing names followed by
  (Unicode) from the CDE language menu at session startup. In this case, the
  locale setting applies by default to all applications run during the CDE
  session.

  Unicode Character Database

  For the convenience of programmers, the source file for the Unicode
  character database is available on line. This source file is the one used
  to build the .UTF-8 locales provided in optional software subsets included
  with the operating system product. When the .UTF-8 locales are installed on
  your system, both the Unicode character database and an associated ReadMe
  file are also installed in the /usr/share/unidata directory.	The ReadMe
  file discusses the character properties supported by Unicode.

  Font Support

  The operating system provides the following types of bitmap fonts for UCS
  characters:

    ·  Public domain Unicode fonts:

	    -etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1
	    -etl-fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1
	    -etl-fixed-medium-r-normal--24-240-72-72-c-120-iso10646-1

    ·  Composite fonts that the libfr_FGC font renderer creates by combining
       fonts available for other codesets

    ·  Two sets of monospaced fonts (a 16x18 pixel set and a 24x24 pixel set)
       for UTF-8 locales with the following CDE font aliases (where -n is -1,
       -2, -3, -4, -5, -7, -8,- 9, or -15):

	    -dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso8859-n@mono
	    -dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso10646-1@mono

  These fonts currently cover only a subset of the characters in UCS.  Each
  of the ETL public domain fonts supports about 1000 characters, but does not
  include any characters for Chinese, Japanese, or Korean. The composite
  fonts created by the font renderer are generated only from fonts available
  for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9) codesets.

  See iso8859-1(5) and iso8859-15(5) for the names of fonts available for
  Latin-1 and Latin-9 characters. The Latin-9 fonts, which include glyphs for
  the euro character, provide the best support for the
  language_territory.UTF-8 locales, which also support this character.

  See i18n_printing(5) and wwpsof(8) for information on printer support and
  converting bitmap font encoding to PostScript.

SEE ALSO
  Commands: locale(1), wwpsof(8)

  Others: ascii(5), code_page(5), iso8859-1(5), iso8859-15(5), i18n_intro(5),
  i18n_printing(5), iconv_intro(5), l10n_intro(5)

  Using International Software
Index for
Section 5
Alphabetical
listing for U
Top of
page