summaryrefslogtreecommitdiff
path: root/doc/html
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2020-02-24 16:35:15 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2020-02-24 16:35:15 +0000
commitd902bb8b5bded98a5b7d48dfee158edd44b9ea5a (patch)
tree35f00c3b4bc346b3ed2499e252eb9c8c35c1685b /doc/html
parent9af350af12899021537ce50c25ba98bdd7c1e5ee (diff)
downloadpcre2-d902bb8b5bded98a5b7d48dfee158edd44b9ea5a.tar.gz
Documentation for PCRE2_UCP handling of upper/lower casing.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1227 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/html')
-rw-r--r--doc/html/pcre2api.html56
-rw-r--r--doc/html/pcre2pattern.html5
-rw-r--r--doc/html/pcre2unicode.html20
3 files changed, 50 insertions, 31 deletions
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index 9562a26..ee056ad 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -1481,13 +1481,13 @@ documentation.
</pre>
If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
-changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
-properties are used for all characters with more than one other case, and for
-all characters whose code points are greater than U+007F. For lower valued
-characters with only one other case, a lookup table is used for speed. When
-PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
-and higher code points (available only in 16-bit or 32-bit mode) are treated as
-not having another case.
+changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
+PCRE2_UCP is set, Unicode properties are used for all characters with more than
+one other case, and for all characters whose code points are greater than
+U+007F. For lower valued characters with only one other case, a lookup table is
+used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
+used for all code points less than 256, and higher code points (available only
+in 16-bit or 32-bit mode) are treated as not having another case.
<pre>
PCRE2_DOLLAR_ENDONLY
</pre>
@@ -1820,16 +1820,23 @@ are not representable in UTF-16.
<pre>
PCRE2_UCP
</pre>
-This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
-\w, and some of the POSIX character classes. By default, only ASCII characters
-are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
-classify characters. More details are given in the section on
+This option has two effects. Firstly, it change the way PCRE2 processes \B,
+\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
+default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
+properties are used instead to classify characters. More details are given in
+the section on
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much
-longer. The option is available only if PCRE2 has been compiled with Unicode
-support (which is the default).
+longer.
+</P>
+<P>
+The second effect of PCRE2_UCP is to force the use of Unicode properties for
+upper/lower casing operations on characters with code points greater than 127,
+even when PCRE2_UTF is not set. This makes it possible, for example, to process
+strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
+been compiled with Unicode support (which is the default).
<pre>
PCRE2_UNGREEDY
</pre>
@@ -1997,14 +2004,20 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code
point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \w or
-\d. When PCRE2 is built with Unicode support (the default), all characters can
-be tested with \p and \P, or, alternatively, the PCRE2_UCP option can be set
-when a pattern is compiled; this causes \w and friends to use Unicode property
-support instead of the built-in tables.
+\d.
+</P>
+<P>
+When PCRE2 is built with Unicode support (the default), the Unicode properties
+of all characters can be tested with \p and \P, or, alternatively, the
+PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
+friends to use Unicode property support instead of the built-in tables.
+PCRE2_UCP also causes upper/lower casing operations on characters with code
+points greater than 127 to use Unicode properties. These effects apply even
+when PCRE2_UTF is not set.
</P>
<P>
The use of locales with Unicode is discouraged. If you are handling characters
-with code points greater than 128, you should either use Unicode support, or
+with code points greater than 127, you should either use Unicode support, or
use locales, but not try to mix the two.
</P>
<P>
@@ -3494,7 +3507,10 @@ terminating a \Q quoted sequence) reverts to no case forcing. The sequences
\u and \l force the next character (if it is a letter) to upper or lower
case, respectively, and then the state automatically reverts to no case
forcing. Case forcing applies to all inserted characters, including those from
-capture groups and letters within \Q...\E quoted sequences.
+capture groups and letters within \Q...\E quoted sequences. If either
+PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
+properties are used for case forcing characters whose code points are greater
+than 127.
</P>
<P>
Note that case forcing sequences such as \U...\E do not nest. For example,
@@ -3915,7 +3931,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 16 February 2020
+Last updated: 24 February 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index a6f1dfc..990f9f0 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -114,7 +114,8 @@ Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 256 via a lookup
-table.
+table. If also causes upper/lower casing operations to use Unicode properties
+for characters with code points greater than 127, even when UTF is not set.
</P>
<P>
Some applications that allow their users to supply patterns may wish to
@@ -3833,7 +3834,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 27 January 2020
+Last updated: 24 February 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>
diff --git a/doc/html/pcre2unicode.html b/doc/html/pcre2unicode.html
index 3d4e6b4..e2b9e78 100644
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@@ -19,7 +19,7 @@ UNICODE AND UTF SUPPORT
PCRE2 is normally built with Unicode support, though if you do not need it, you
can build it without, in which case the library will be smaller. With Unicode
support, PCRE2 has knowledge of Unicode character properties and can process
-text strings in UTF-8, UTF-16, or UTF-32 format (depending on the code unit
+strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
width), but this is not the default. Unless specifically requested, PCRE2
treats each code unit in a string as one character.
</P>
@@ -134,14 +134,16 @@ However, the special horizontal and vertical white space matching escapes (\h,
not PCRE2_UCP is set.
</P>
<br><b>
-CASE-EQUIVALENCE IN UTF MODE
+UNICODE CASE-EQUIVALENCE
</b><br>
<P>
-Case-insensitive matching in UTF mode makes use of Unicode properties except
-for characters whose code points are less than 128 and that have at most two
-case-equivalent values. For these, a direct table lookup is used for speed. A
-few Unicode characters such as Greek sigma have more than two code points that
-are case-equivalent, and these are treated specially.
+If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
+of Unicode properties except for characters whose code points are less than 128
+and that have at most two case-equivalent values. For these, a direct table
+lookup is used for speed. A few Unicode characters such as Greek sigma have
+more than two code points that are case-equivalent, and these are treated
+specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
+processing for non-UTF character encodings such as UCS-2.
<a name="scriptruns"></a></P>
<br><b>
SCRIPT RUNS
@@ -484,9 +486,9 @@ Cambridge, England.
REVISION
</b><br>
<P>
-Last updated: 24 May 2019
+Last updated: 23 February 2020
<br>
-Copyright &copy; 1997-2019 University of Cambridge.
+Copyright &copy; 1997-2020 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.