summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-04-26 10:44:13 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-04-26 10:44:13 +0000
commitf2088f309ecbe5ead0c14a77f2bfa6ed0d88992e (patch)
treed0f55d002ef0b268bae1eb5cbbbbf7d41294df2d
parent4d6103a376b6adb2b15ee14fb5b9a245dfbd05f6 (diff)
downloadpcre-f2088f309ecbe5ead0c14a77f2bfa6ed0d88992e.tar.gz
Documentation updates.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1314 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog4
-rw-r--r--doc/pcre.311
-rw-r--r--doc/pcreapi.353
-rw-r--r--doc/pcrepattern.3114
-rw-r--r--doc/pcresyntax.36
-rw-r--r--doc/pcretest.135
6 files changed, 159 insertions, 64 deletions
diff --git a/ChangeLog b/ChangeLog
index c277e03..0528b40 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -141,7 +141,9 @@ Version 8.33 xx-xxxx-201x
37. The value of the max lookbehind was not correctly preserved if a compiled
and saved regex was reloaded on a host of different endianness.
-38. Implemented (*LIMIT_MATCH) and (*LIMIT_RECURSION).
+38. Implemented (*LIMIT_MATCH) and (*LIMIT_RECURSION). As part of the extension
+ of the compiled pattern block, expand the flags field from 16 to 32 bits
+ because it was almost full.
Version 8.32 30-November-2012
diff --git a/doc/pcre.3 b/doc/pcre.3
index 84928f4..f48afeb 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -1,4 +1,4 @@
-.TH PCRE 3 "11 November 2012" "PCRE 8.32"
+.TH PCRE 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH INTRODUCTION
@@ -121,8 +121,11 @@ checked for UTF-8 validity. If the data string is very long, such a check might
use sufficiently many resources as to cause your application to lose
performance.
.P
-The best way of guarding against this possibility is to use the
+One way of guarding against this possibility is to use the
\fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF.
+Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at
+compile time. This causes an compile time error if a pattern contains a
+UTF-setting sequence.
.P
If your application is one that supports UTF, be aware that validity checking
can take time. If the same data string is to be matched many times, you can use
@@ -197,6 +200,6 @@ two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 11 November 2012
-Copyright (c) 1997-2012 University of Cambridge.
+Last updated: 26 April 2013
+Copyright (c) 1997-2013 University of Cambridge.
.fi
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 94912a5..6407144 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "05 April 2013" "PCRE 8.33"
+.TH PCREAPI 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.sp
@@ -761,7 +761,7 @@ This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or
UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the
pattern with (*UTF). This may be useful in applications that process patterns
-from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
+from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
causes an error.
.sp
PCRE_NEWLINE_CR
@@ -1092,13 +1092,13 @@ In 32-bit mode, the bitmap is used for 32-bit values less than 256.)
.P
These two optimizations apply to both \fBpcre_exec()\fP and
\fBpcre_dfa_exec()\fP, and the information is also used by the JIT compiler.
-The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
+The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
You might want to do this if your pattern contains callouts or (*MARK) and you
want to make use of these facilities in cases where matching fails.
.P
PCRE_NO_START_OPTIMIZE can be specified at either compile time or execution
-time. However, if PCRE_NO_START_OPTIMIZE is passed to \fBpcre_exec()\fP, (that
-is, after any JIT compilation has happened) JIT execution is disabled. For JIT
+time. However, if PCRE_NO_START_OPTIMIZE is passed to \fBpcre_exec()\fP, (that
+is, after any JIT compilation has happened) JIT execution is disabled. For JIT
execution to work with PCRE_NO_START_OPTIMIZE, the option must be set at
compile time.
.P
@@ -1193,6 +1193,7 @@ the following negative numbers:
PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
endianness
PCRE_ERROR_BADOPTION the value of \fIwhat\fP was invalid
+ PCRE_ERROR_UNSET the requested field is not set
.sp
The "magic number" is placed at the start of each compiled pattern as an simple
check against passing an arbitrary memory pointer. The endianness error can
@@ -1311,6 +1312,13 @@ to return the full 32-bit range of the character, this value is deprecated;
instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_REQUIREDCHAR values should
be used.
.sp
+ PCRE_INFO_MATCHLIMIT
+.sp
+If the pattern set a match limit by including an item of the form
+(*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth argument
+should point to an unsigned 32-bit integer. If no such value has been set, the
+call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
+.sp
PCRE_INFO_MAXLOOKBEHIND
.sp
Return the number of characters (NB not bytes) in the longest lookbehind
@@ -1319,8 +1327,8 @@ matching using the partial matching facilities. Note that the simple assertions
\eb and \eB require a one-character lookbehind. \eA also registers a
one-character lookbehind, though it does not actually inspect the previous
character. This is to ensure that at least one character from the old segment
-is retained when a new segment is processed. Otherwise, if there are no
-lookbehinds in the pattern, \eA might match incorrectly at the start of a new
+is retained when a new segment is processed. Otherwise, if there are no
+lookbehinds in the pattern, \eA might match incorrectly at the start of a new
segment.
.sp
PCRE_INFO_MINLENGTH
@@ -1430,6 +1438,13 @@ alternatives begin with one of the following:
For such patterns, the PCRE_ANCHORED bit is set in the options returned by
\fBpcre_fullinfo()\fP.
.sp
+ PCRE_INFO_RECURSIONLIMIT
+.sp
+If the pattern set a recursion limit by including an item of the form
+(*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
+argument should point to an unsigned 32-bit integer. If no such value has been
+set, the call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
+.sp
PCRE_INFO_SIZE
.sp
Return the size of the compiled pattern in bytes (for both libraries). The
@@ -1663,6 +1678,15 @@ block in which \fImatch_limit\fP is set, and PCRE_EXTRA_MATCH_LIMIT is set in
the \fIflags\fP field. If the limit is exceeded, \fBpcre_exec()\fP returns
PCRE_ERROR_MATCHLIMIT.
.P
+A value for the match limit may also be supplied by an item at the start of a
+pattern of the form
+.sp
+ (*LIMIT_MATCH=d)
+.sp
+where d is a decimal number. However, such a setting is ignored unless d is
+less than the limit set by the caller of \fBpcre_exec()\fP or, if no such limit
+is set, less than the default.
+.P
The \fImatch_limit_recursion\fP field is similar to \fImatch_limit\fP, but
instead of limiting the total number of times that \fBmatch()\fP is called, it
limits the depth of recursion. The recursion depth is a smaller number than the
@@ -1681,6 +1705,15 @@ with a \fBpcre_extra\fP block in which \fImatch_limit_recursion\fP is set, and
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the \fIflags\fP field. If the limit
is exceeded, \fBpcre_exec()\fP returns PCRE_ERROR_RECURSIONLIMIT.
.P
+A value for the recursion limit may also be supplied by an item at the start of
+a pattern of the form
+.sp
+ (*LIMIT_RECURSION=d)
+.sp
+where d is a decimal number. However, such a setting is ignored unless d is
+less than the limit set by the caller of \fBpcre_exec()\fP or, if no such limit
+is set, less than the default.
+.P
The \fIcallout_data\fP field is used in conjunction with the "callout" feature,
and is described in the
.\" HREF
@@ -2372,8 +2405,8 @@ never occur in a valid UTF-8 string.
PCRE_UTF8_ERR22
.sp
This error code was formerly used when the presence of a so-called
-"non-character" caused an error. Unicode corrigendum #9 makes it clear that
-such characters should not cause a string to be rejected, and so this code is
+"non-character" caused an error. Unicode corrigendum #9 makes it clear that
+such characters should not cause a string to be rejected, and so this code is
no longer in use and is never returned.
.
.
@@ -2843,6 +2876,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 April 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 9b124a2..50a1ab0 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "05 April 2013" "PCRE 8.33"
+.TH PCREPATTERN 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -20,6 +20,34 @@ have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
.P
+This document discusses the patterns that are supported by PCRE when one its
+main matching functions, \fBpcre_exec()\fP (8-bit) or \fBpcre[16|32]_exec()\fP
+(16- or 32-bit), is used. PCRE also has alternative matching functions,
+\fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP, which match using a
+different algorithm that is not Perl-compatible. Some of the features discussed
+below are not available when DFA matching is used. The advantages and
+disadvantages of the alternative functions, and how they differ from the normal
+functions, are discussed in the
+.\" HREF
+\fBpcrematching\fP
+.\"
+page.
+.
+.
+.SH "SPECIAL START-OF-PATTERN ITEMS"
+.rs
+.sp
+A number of options that can be passed to \fBpcre_compile()\fP can also be set
+by special items at the start of a pattern. These are not Perl-compatible, but
+are provided to make these options accessible to pattern writers who are not
+able to change the program that processes the pattern. Any number of these
+items may appear, but they must all be together right at the start of the
+pattern string, and the letters must be in upper case.
+.
+.
+.SS "UTF support"
+.rs
+.sp
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 strings in the original library, an
extra library that supports 16-bit and UTF-16 character strings, and a
@@ -36,16 +64,23 @@ these special sequences:
.sp
(*UTF) is a generic sequence that can be used with any of the libraries.
Starting a pattern with such a sequence is equivalent to setting the relevant
-option. This feature is not Perl-compatible. How setting a UTF mode affects
-pattern matching is mentioned in several places below. There is also a summary
-of features in the
+option. How setting a UTF mode affects pattern matching is mentioned in several
+places below. There is also a summary of features in the
.\" HREF
\fBpcreunicode\fP
.\"
page.
.P
-Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
+Some applications that allow their users to supply patterns may wish to
+restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF
+option is set at compile time, (*UTF) etc. are not allowed, and their
+appearance causes an error.
+.
+.
+.SS "Unicode property support"
+.rs
+.sp
+Another special sequence that may appear at the start of a pattern is
.sp
(*UCP)
.sp
@@ -53,38 +88,17 @@ This has the same effect as setting the PCRE_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types,
instead of recognizing only characters with codes less than 128 via a lookup
table.
-.P
-If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
-PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
-also some more of these special sequences that are concerned with the handling
-of newlines; they are described below.
-.P
-The remainder of this document discusses the patterns that are supported by
-PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or
-\fBpcre[16|32]_exec()\fP (16- or 32-bit), is used. PCRE also has alternative
-matching functions, \fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP,
-which match using a different algorithm that is not Perl-compatible. Some of
-the features discussed below are not available when DFA matching is used. The
-advantages and disadvantages of the alternative functions, and how they differ
-from the normal functions, are discussed in the
-.\" HREF
-\fBpcrematching\fP
-.\"
-page.
.
.
-.SH "EBCDIC CHARACTER CODES"
+.SS "Disabling start-up optimizations"
.rs
.sp
-PCRE can be compiled to run in an environment that uses EBCDIC as its character
-code rather than ASCII or Unicode (typically a mainframe system). In the
-sections below, character code values are ASCII or Unicode; in an EBCDIC
-environment these characters may have different code values, and there are no
-code points greater than 255.
+If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
+PCRE_NO_START_OPTIMIZE option either at compile or matching time.
.
.
.\" HTML <a name="newlines"></a>
-.SH "NEWLINE CONVENTIONS"
+.SS "Newline conventions"
.rs
.sp
PCRE supports five different conventions for indicating line breaks in
@@ -117,9 +131,7 @@ example, on a Unix system where LF is the default newline sequence, the pattern
(*CR)a.b
.sp
changes the convention to CR. That pattern matches "a\enb" because LF is no
-longer a newline. Note that these special settings, which are not
-Perl-compatible, are recognized only at the very start of a pattern, and that
-they must be in upper case. If more than one of them is present, the last one
+longer a newline. If more than one of these settings is present, the last one
is used.
.P
The newline convention affects where the circumflex and dollar assertions are
@@ -136,6 +148,38 @@ below. A change of \eR setting can be combined with a change of newline
convention.
.
.
+.SS "Setting match and recursion limits"
+.rs
+.sp
+The caller of \fBpcre_exec()\fP can set a limit on the number of times the
+internal \fBmatch()\fP function is called and on the maximum depth of
+recursive calls. These facilities are provided to catch runaway matches that
+are provoked by patterns with huge matching trees (a typical example is a
+pattern with nested unlimited repeats) and to avoid running out of system stack
+by too much recursion. When one of these limits is reached, \fBpcre_exec()\fP
+gives an error return. The limits can also be set by items at the start of the
+pattern of the form
+.sp
+ (*LIMIT_MATCH=d)
+ (*LIMIT_RECURSION=d)
+.sp
+where d is any number of decimal digits. However, the value of the setting must
+be less than the value set by the caller of \fBpcre_exec()\fP for it to have
+any effect. In other words, the pattern writer can lower the limit set by the
+programmer, but not raise it. If there is more than one setting of one of these
+limits, the lower value is used.
+.
+.
+.SH "EBCDIC CHARACTER CODES"
+.rs
+.sp
+PCRE can be compiled to run in an environment that uses EBCDIC as its character
+code rather than ASCII or Unicode (typically a mainframe system). In the
+sections below, character code values are ASCII or Unicode; in an EBCDIC
+environment these characters may have different code values, and there are no
+code points greater than 255.
+.
+.
.SH "CHARACTERS AND METACHARACTERS"
.rs
.sp
@@ -3101,6 +3145,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 April 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
diff --git a/doc/pcresyntax.3 b/doc/pcresyntax.3
index fb229d3..c7b92cf 100644
--- a/doc/pcresyntax.3
+++ b/doc/pcresyntax.3
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "27 February 2013" "PCRE 8.33"
+.TH PCRESYNTAX 3 "26 April 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -347,6 +347,8 @@ but some of them use Unicode properties if PCRE_UCP is set. You can use
The following are recognized only at the start of a pattern or after one of the
newline-setting options with similar syntax:
.sp
+ (*LIMIT_MATCH=d) set the match limit to d (decimal number)
+ (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
@@ -493,6 +495,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 27 February 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index cae1522..2fb121d 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -1,4 +1,4 @@
-.TH PCRETEST 1 "05 April 2013" "PCRE 8.33"
+.TH PCRETEST 1 "26 April 2013" "PCRE 8.33"
.SH NAME
pcretest - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -40,23 +40,34 @@ PCRE, and are unlikely to be of use otherwise. They are all documented here,
but without much justification.
.
.
+.SH "INPUT DATA FORMAT"
+.rs
+.sp
+Input to \fBpcretest\fP is processed line by line, either by calling the C
+library's \fBfgets()\fP function, or via the \fBlibreadline\fP library (see
+below). In Unix-like environments, \fBfgets()\fP treats any bytes other than
+newline as data characters. However, in some Windows environments character 26
+(hex 1A) causes an immediate end of file, and no further data is read. For
+maximum portability, therefore, it is safest to use only ASCII characters in
+\fBpcretest\fP input files.
+.
+.
.SH "PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES"
.rs
.sp
From release 8.30, two separate PCRE libraries can be built. The original one
supports 8-bit character strings, whereas the newer 16-bit library supports
-character strings encoded in 16-bit units. From release 8.32, a third
-library can be built, supporting character strings encoded in 32-bit units.
-The \fBpcretest\fP program can be
-used to test all three libraries. However, it is itself still an 8-bit program,
-reading 8-bit input and writing 8-bit output. When testing the 16-bit or 32-bit
-library, the patterns and data strings are converted to 16- or 32-bit format
-before being passed to the PCRE library functions. Results are converted to
-8-bit for output.
+character strings encoded in 16-bit units. From release 8.32, a third library
+can be built, supporting character strings encoded in 32-bit units. The
+\fBpcretest\fP program can be used to test all three libraries. However, it is
+itself still an 8-bit program, reading 8-bit input and writing 8-bit output.
+When testing the 16-bit or 32-bit library, the patterns and data strings are
+converted to 16- or 32-bit format before being passed to the PCRE library
+functions. Results are converted to 8-bit for output.
.P
References to functions and structures of the form \fBpcre[16|32]_xx\fP below
-mean "\fBpcre_xx\fP when using the 8-bit library or \fBpcre16_xx\fP when using
-the 16-bit library".
+mean "\fBpcre_xx\fP when using the 8-bit library, \fBpcre16_xx\fP when using
+the 16-bit library, or \fBpcre32_xx\fP when using the 32-bit library".
.
.
.SH "COMMAND LINE OPTIONS"
@@ -1083,6 +1094,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 April 2013
+Last updated: 26 April 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi