[Patch] doc patch on Unicode

Message-Id: <20020519005515.18F0.BQW10602@nifty.com> p4raw-id: //depot/perl@16676
author: SADAHIRO Tomoyuki <BQW10602@nifty.com> 2002-05-19 10:01:58 +0900
committer: Jarkko Hietaniemi <jhi@iki.fi> 2002-05-18 15:40:35 +0000
commit: ec90690f0842364bbbf48a984e7382b2d660d09d (patch)
tree: 5f95a6c7952bff14b75a9b38a6d3e9bc60967e0c /pod/perlunicode.pod
parent: 55b010e5dabcd8fc816875f76a8e02460766d4e5 (diff)
download: perl-ec90690f0842364bbbf48a984e7382b2d660d09d.tar.gz
1 files changed, 14 insertions, 14 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d2c48e26b5..38cd9c7b20 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -162,7 +162,7 @@ Named Unicode properties, scripts, and block ranges may be used like
 character classes via the new C<\p{}> (matches property) and C<\P{}>
 (doesn't match property) constructs. For instance, C<\p{Lu}> matches any
 character with the Unicode "Lu" (Letter, uppercase) property, while
-C<\p{M}> matches any character with a "M" (mark -- accents and such)
+C<\p{M}> matches any character with an "M" (mark -- accents and such)
 property. Single letter properties may omit the brackets, so that can be
 written C<\pM> also. Many predefined properties are available, such
 as C<\p{Mirrored}> and C<\p{Tibetan}>.
@@ -814,11 +814,11 @@ The following table is from Unicode 3.2.
 
    U+0000..U+007F       00..7F
    U+0080..U+07FF       C2..DF    80..BF
-   U+0800..U+0FFF       E0        A0..BF    80..BF��
-   U+1000..U+CFFF       E1..EC    80..BF    80..BF��
-   U+D000..U+D7FF       ED        80..9F    80..BF��
+   U+0800..U+0FFF       E0        A0..BF    80..BF
+   U+1000..U+CFFF       E1..EC    80..BF    80..BF
+   U+D000..U+D7FF       ED        80..9F    80..BF
    U+D800..U+DFFF       ******* ill-formed *******
-   U+E000..U+FFFF       EE..EF    80..BF    80..BF��
+   U+E000..U+FFFF       EE..EF    80..BF    80..BF
   U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
   U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
  U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
@@ -857,15 +857,15 @@ UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks)
 use them internally.)
 
 UTF-16 is a 2 or 4 byte encoding.  The Unicode code points
-0x0000..0xFFFF are stored in two 16-bit units, and the code points
-0x010000..0x10FFFF in two 16-bit units.  The latter case is
+U+0000..U+FFFF are stored in a single 16-bit unit, and the code points
+U+10000..U+10FFFF in two 16-bit units.  The latter case is
 using I<surrogates>, the first 16-bit unit being the I<high
 surrogate>, and the second being the I<low surrogate>.
 
-Surrogates are code points set aside to encode the 0x01000..0x10FFFF
+Surrogates are code points set aside to encode the U+10000..U+10FFFF
 range of Unicode code points in pairs of 16-bit units.  The I<high
-surrogates> are the range 0xD800..0xDBFF, and the I<low surrogates>
-are the range 0xDC00..0xDFFFF.  The surrogate encoding is
+surrogates> are the range U+D800..U+DBFF, and the I<low surrogates>
+are the range U+DC00..U+DFFF.  The surrogate encoding is
 
 	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
 	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
@@ -888,7 +888,7 @@ This introduces another problem: what if you just know that your data
 is UTF-16, but you don't know which endianness?  Byte Order Marks
 (BOMs) are a solution to this.  A special character has been reserved
 in Unicode to function as a byte order marker: the character with the
-code point 0xFEFF is the BOM.
+code point U+FEFF is the BOM.
 
 The trick is that if you read a BOM, you will know the byte order,
 since if it was written on a big endian platform, you will read the
@@ -897,9 +897,9 @@ you will read the bytes 0xFF 0xFE.  (And if the originating platform
 was writing in UTF-8, you will read the bytes 0xEF 0xBB 0xBF.)
 
 The way this trick works is that the character with the code point
-0xFFFE is guaranteed not to be a valid Unicode character, so the
+U+FFFE is guaranteed not to be a valid Unicode character, so the
 sequence of bytes 0xFF 0xFE is unambiguously "BOM, represented in
-little-endian format" and cannot be "0xFFFE, represented in big-endian
+little-endian format" and cannot be "U+FFFE, represented in big-endian
 format".
 
 =item *
@@ -916,7 +916,7 @@ needed.  The BOM signatures will be 0x00 0x00 0xFE 0xFF for BE and
 UCS-2, UCS-4
 
 Encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
-encoding.  Unlike UTF-16, UCS-2 is not extensible beyond 0xFFFF,
+encoding.  Unlike UTF-16, UCS-2 is not extensible beyond U+FFFF,
 because it does not use surrogates.  UCS-4 is a 32-bit encoding,
 functionally identical to UTF-32.
author	SADAHIRO Tomoyuki <BQW10602@nifty.com>	2002-05-19 10:01:58 +0900
committer	Jarkko Hietaniemi <jhi@iki.fi>	2002-05-18 15:40:35 +0000
commit	ec90690f0842364bbbf48a984e7382b2d660d09d (patch)
tree	5f95a6c7952bff14b75a9b38a6d3e9bc60967e0c /pod/perlunicode.pod
parent	55b010e5dabcd8fc816875f76a8e02460766d4e5 (diff)
download	perl-ec90690f0842364bbbf48a984e7382b2d660d09d.tar.gz