Unicode(5)

Index for
Section 5
Alphabetical
listing for U
Bottom of
page
Unicode(5)
NAME
  Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32,
  iso10646 - Support for the Unicode and ISO/IEC 10646 standards

DESCRIPTION
  The operating system provides locales and codeset converters that support
  the following standards:

    ·  The Unicode Standard, Version 3.0, Unicode, Inc., 1999

    ·  Information Technology-Universal Multiple-Octet Coded Character Set,
       ISO/IEC 10646:1993

       The Basic Multilingual Plane defined by this standard is identical
       with the main body of Unicode character encoding.

  These standards define generalized character encoding rules that can be
  applied to characters in most native language scripts. The Unicode Standard
  specifies a universal character set (UCS) that contains definitions in
  Version 3.0 for 49,194 characters and also includes a Private Use Area for
  vendor- or user-defined characters. The following list summarizes the main
  features of this character set:

    ·  All characters are treated as 16-bit units.

    ·  Each 16-bit unit has an abstract character identity.

    ·  Certain sequences of 16-bit characters in a text stream are
       transformed into other characters, called composed characters.

    ·  Characters have properties, such as base, numeric, spacing,
       combination, and directionality. The Unicode standard provides rules
       for ordering characters with different properties so that parsing of
       character sequences is unambiguous.

    ·  The relationship between Unicode characters and the glyphs in the
       native language script that users see, type, or print is not
       necessarily one-to-one. A glyph may be mapped to a single abstract
       character or a composed character. Conversely, more than one glyph can
       be mapped to a character.

    ·  The ISO 8859-1 character set occupies the first 256 code positions
       (and the ASCII character set the first 128 positions) of the UCS.

  The ISO/IEC 10646 standard specifies both 16- and 32-bit units for each
  abstract character defined in the the UCS.  The 16-bit character values in
  Unicode are zero-extended through a second 16-bit unit in the larger
  encoding format. The second, or low-surrogate, 16-bit unit is reserved for
  future use in both standards.

  The Unicode and ISO/IEC 10646 standards specify a uniform character size
  and allow character units to be processed for all languages by using the
  same set of rules. Therefore, system support for the universal character
  set does not need to include multiple algorithms (one or more per language)
  for converting between file code and internal process code. However, the
  two different character sizes (16-bit or 32-bit) that the standards support
  require different parsing schemes for data input and output. Universal
  character encoding that an implementation parses in 16-bit units (2 octets)
  is known as UCS-2.  This is the canonical Unicode encoding in wide use on
  PC systems. Universal character encoding that an implementation parses in
  32-bit units (4 octets) is known as UCS-4. This is the canonical ISO/IEC
  10646 encoding that is in use on systems that can support the larger data
  unit size.

  The operating system supports UCS-2 with codeset converters and UCS-4 with
  both codeset converters and locales. The locales whose names include the
  string @ucs4 allow use of UCS-4 for internal process code with proprietary
  file encoding formats.

  The standards define a number of transformation formats for the universal
  character set.  For the most part, the following UCS transformation formats
  (UTFs) exist to transform UCS values into sequences of bytes for handling
  by various byte-oriented protocols:

    ·  UTF-8, the standard method for transforming UCS-4 process encoding
       into a sequence of 8-bit bytes and ensuring interchange transparency
       for characters in C0 code positions (0 to 31), the SPACE (32)
       character, and the DEL (127) character

       The operating system supports UTF-8 with both codeset converters and
       locales.

    ·  UTF-7, an obsolete interchange format for environments that strip the
       eighth bit from each byte

       The operating system does not support UTF-7.

    ·  UTF-1, an obsolete interchange format that is similar to UTF-8 but
       also ensures interchange transparency of characters in C1 code
       positions (128 to 159)

       The operating system does not support UTF-1.

    ·  UTF-16, which handles the surrogate character extensions defined by
       Version 2.0 of the Unicode Standard and represents characters in
       2-byte units

       The surrogate character extensions are characters whose values in
       UCS-4 are outside the range normally allowed by a 16-bit length
       restriction.  When data includes these characters, the UTF-16
       transformation format enables data exchange between applications using
       UCS-4 and applications that require the data to be in UCS-2 (2-byte)
       format. Although UTF-16 does not support representation of the entire
       UCS-4 code space, it supports all characters (except those in certain
       private-use ranges) that have been currently defined for the languages
       covered by both standards.

       Byte orientation in file code can differ and, depending on the
       platform on which the file was generated, can be little-endian (LE) or
       big-endian (BE).	 UTF-16 uses a byte order mark (BOM), which is not
       part of the file text data, to indicate byte orientation. The code
       point of the BOM is U+FEFF. The Unicode Standard also defines UTF-16LE
       and UTF-16BE, which are specific to the little-endian and big-endian
       orientations, respectively, and do not include a byte order mark.

       The operating system supports UTF-16, UTF-16LE, and UTF-16BE through
       codeset converters. In terms of codeset converter names, UTF-16* is
       recognized as an alias for UCS-2 but also enables codeset conversion
       of surrogate character extensions.

					Note

	 By default, the operating system uses UTF-16 rather than UTF-16LE or
	 UTF-16BE. That is, in an input file, the software first looks for a
	 BOM. If a BOM is not found, the converter assumes UTF-16LE. This
	 means that you must explicitly specify UTF-16BE to the converter
	 (convert files manually) when UTF-16BE applies to an input file. For
	 an output file, the converter automatically inserts a BOM. This
	 means that you must explicitly specify UTF-16LE or UTF-16BE (convert
	 files manually) when you want conversion output to be UTF-16LE or
	 UTF-16BE rather than UTF-16.

    ·  UTF-32, which also supports the surrogate character extensions defined
       by the Unicode Standard but allows character representation in 4-byte
       encoding units

       In addition, UTF-32 is restricted in values to the range 0 to 10FFFF,
       which precisely matches the range of character values defined in the
       Unicode Standard. Unlike UTF-16, UTF-32 does not support private-use
       ranges for character values and therefore promotes interoperability
       among Unicode encoding formats.

       UTF-32 uses a byte order mark to indicate little-endian or big-endian
       byte orientation. The Unicode standard also defines UTF-32LE and UTF-
       32BE , which are specific to the little-endian and big-endian
       orientations, respectively, and do not include a byte order mark.

       UTF-32 is almost the same as UCS-4, so you can use UCS-4 codeset
       converters to process UTF-32. However, the UCS-4 converter software
       has not yet been changed to support UTF-32, UTF-32LE, or UTF-32BE as
       alias names in the way that the UTF-16* strings are supported by the
       UCS-2 converters.

  Codeset Conversion

  Codeset converters are available to convert data in all the major encoding
  formats that the operating system supports to and from UCS-2, UCS-4, and
  UTF-8.  If the worldwide support subsets are installed on your system, you
  can enter the following commands to find the names of these converters:

       % cd /usr/lib/nls/loc/iconv
       % ls | grep UTF
       % ls | grep UCS

  Among the converters listed, you will find some that handle conversion of
  data in the code-page format used on PC systems. See the code_page(5)
  reference page for more information about converting between codeset and
  code-page formats.  All codeset converters can be used with the iconv
  command and associated library functions.

				     Note

       There was a change in mapping of Korean Hangul characters between
       Version 1.1 and Version 2.0 of the Unicode Standard. By default, UCS-
       2, UCS-4, and UTF-8 conversion assumes Version 2.0 character mapping
       for Hangul characters.  Therefore, if data is in Version 1.1 format,
       the data must first be converted to Version 2.0 format before
       converting from UCS-2, UCS-4, or UTF-8 to an entirely different
       format. The format of a codeset converter name is from-codeset_to-
       codeset.	 In converter names, the Version 1.1 codeset formats for
       UCS-2, UCS-4, and UTF-8 are represented by UNICODE-1-1, UNICODE-1-1-
       UCS-4, and UNICODE-1-1-UTF-8, respectively. The Version 2.0 codeset
       names are represented by UCS-2, UCS-4, and UTF-8. For example, if
       Korean data is currently in UCS-4 Version 1.1 format, the data must
       first be processed by the UNICODE-1-1-UCS-4_UCS-4 converter before
       being processed by the UCS-4_deckorean converter.

  See the iconv_intro(5) reference page for general information on codeset
  conversion.

  Locales

  The following locales use UCS-4 as internal processing code:

    ·  universal.UTF-8

       This locale converts data in UTF-8 file format to UCS-4 process code.
       The locale can be used to test any UCS-4 character to determine if it
       is included in one of the following classes defined for the LC_CTYPE
       category: alnum, alpha, blank, cntrl, digit, graph, lower, print,
       punct, space, upper, or xdigit.

       In the universal.utf8@ucs4 locale, the LC_MESSAGES, LC_MONETARY,
       LC_NUMERIC, and LC_TIME category definitions match those for the POSIX
       (C) locale.

    ·  native_locale_name@ucs4

       These locales (for example, fr_FR.ISO8859-1@ucs4) perform the same
       function as the universal.UTF-8 locale but are different in the
       following ways:

	 --
	 The file code is specified by the codeset portion (for example,
	 ISO8859-1) of native_locale_name.

	 --
	 Classification information is not provided for the full set of UCS-4
	 characters, but only for those in a particular native language (for
	 example, French).

	 --
	 Country-specific data is also available to the application.  The
	 LC_COLLATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME
	 category definitions match those defined in native_locale_name.

    ·  language_territory.UTF-8

       These locales (for example, fr_FR.UTF-8) are similar to the @ucs4
       locales in limiting classification information to the characters in a
       particular native language and making country-specific data available
       to the application. However, the .UTF-8 locales assume file data
       follows UTF-8 encoding rules and are the only locales that support the
       euro monetary character (C=).

					Note

	 The X locale database file used by applications running in the
	 universal.UTF-8, en_US.UTF-8, or Asian locales (Chinese, Japanese,
	 Korean) contains font definitions that include all the various fonts
	 used with the operating system. This enables applications under
	 en_US.UTF-8 to display all the font characters installed with
	 Worldwide Language Support (WLS). Applications under the Asian
	 locales display all the font characters installed with WLS, except
	 for ISO8859-2, -4, -5, -7, -8, -9, and TACTIS.

  CDE desktop users can select .UTF-8 locales by choosing names followed by
  (Unicode) from the CDE language menu at session startup. In this case, the
  locale setting applies by default to all applications run during the CDE
  session.

  Unicode Character Database

  For the convenience of programmers, the source file for the Unicode
  character database (Version 3.0.0) is available online. This source file is
  the one used to build the .UTF-8 locales provided in optional software
  subsets included with the operating system product. If the .UTF-8 locales
  are installed on your system, both the Unicode character database and an
  associated ReadMe file are also installed in the /usr/share/unidata
  directory.  The ReadMe file discusses the character properties supported by
  Unicode.

  Font Support

  The operating system provides the following types of bitmap fonts for UCS
  characters:

    ·  Public domain Unicode fonts:

	    -etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1
	    -etl-fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1
	    -etl-fixed-medium-r-normal--24-240-72-72-c-120-iso10646-1

    ·  Composite fonts that the libfr_FGC font renderer creates by combining
       fonts available for other codesets

  These fonts currently cover only a subset of the characters in UCS.  Each
  of the ETL public domain fonts supports about 1000 characters, but does not
  include any characters for Chinese, Japanese, or Korean. The composite
  fonts created by the font renderer are generated only from fonts available
  for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9) codesets.

  Refer to iso8859-1(5) and iso8859-15(5) for the names of fonts available
  for Latin-1 and Latin-9 characters. Note that the Latin-9 fonts, which
  include glyphs for the euro character, provide the best support for the
  language_territory.UTF-8 locales, which also support this character.

  For information on printer support and converting bitmap font encoding to
  PostScript, see i18n_printing(5) and wwpsof(8).

SEE ALSO
  Commands: locale(1), wwpsof(8)

  Others: ascii(5), code_page(5), iso8859-1(5), iso8859-15(5), i18n_intro(5),
  i18n_printing(5), iconv_intro(5), l10n_intro(5)
Index for
Section 5
Alphabetical
listing for U
Top of
page