summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-05-12 16:33:19 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-05-12 16:33:19 +0000
commit2a0d7e55879dfe10247c0446a57b55a163e69d7f (patch)
tree78503e8f325f24fbcc627c5085457e132ba3d42e
parent3dba7ab3ed2ffad9049664eb4a99273597e6b24c (diff)
downloadpcre-2a0d7e55879dfe10247c0446a57b55a163e69d7f.tar.gz
Updated html docs.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1329 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--doc/html/pcre16.html9
-rw-r--r--doc/html/pcre32.html9
-rw-r--r--doc/html/pcre_dfa_exec.html9
-rw-r--r--doc/html/pcre_exec.html9
-rw-r--r--doc/html/pcreapi.html99
-rw-r--r--doc/pcre.txt1163
6 files changed, 664 insertions, 634 deletions
diff --git a/doc/html/pcre16.html b/doc/html/pcre16.html
index 179b2ad..3ade219 100644
--- a/doc/html/pcre16.html
+++ b/doc/html/pcre16.html
@@ -259,8 +259,9 @@ buffer, including the zero terminator if the string was zero-terminated.
</P>
<br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br>
<P>
-The offsets within subject strings that are returned by the matching functions
-are in 16-bit units rather than bytes.
+The lengths and starting offsets of subject strings must be specified in 16-bit
+data units, and the offsets within subject strings that are returned by the
+matching functions are in also 16-bit units rather than bytes.
</P>
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>
<P>
@@ -374,9 +375,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 08 November 2012
+Last updated: 12 May 2013
<br>
-Copyright &copy; 1997-2012 University of Cambridge.
+Copyright &copy; 1997-2013 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcre32.html b/doc/html/pcre32.html
index 629cf7c..2155ee8 100644
--- a/doc/html/pcre32.html
+++ b/doc/html/pcre32.html
@@ -259,8 +259,9 @@ buffer, including the zero terminator if the string was zero-terminated.
</P>
<br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br>
<P>
-The offsets within subject strings that are returned by the matching functions
-are in 32-bit units rather than bytes.
+The lengths and starting offsets of subject strings must be specified in 32-bit
+data units, and the offsets within subject strings that are returned by the
+matching functions are in also 32-bit units rather than bytes.
</P>
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>
<P>
@@ -373,9 +374,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 08 November 2012
+Last updated: 12 May 2013
<br>
-Copyright &copy; 1997-2012 University of Cambridge.
+Copyright &copy; 1997-2013 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcre_dfa_exec.html b/doc/html/pcre_dfa_exec.html
index 663e1d0..e91b670 100644
--- a/doc/html/pcre_dfa_exec.html
+++ b/doc/html/pcre_dfa_exec.html
@@ -50,16 +50,17 @@ are:
<i>extra</i> Points to an associated <b>pcre[16|32]_extra</b> structure,
or is NULL
<i>subject</i> Points to the subject string
- <i>length</i> Length of the subject string, in bytes
- <i>startoffset</i> Offset in bytes in the subject at which to
- start matching
+ <i>length</i> Length of the subject string
+ <i>startoffset</i> Offset in the subject at which to start matching
<i>options</i> Option bits
<i>ovector</i> Points to a vector of ints for result offsets
<i>ovecsize</i> Number of elements in the vector
<i>workspace</i> Points to a vector of ints used as working space
<i>wscount</i> Number of elements in the vector
</pre>
-The options are:
+The units for <i>length</i> and <i>startoffset</i> are bytes for
+<b>pcre_exec()</b>, 16-bit data items for <b>pcre16_exec()</b>, and 32-bit items
+for <b>pcre32_exec()</b>. The options are:
<pre>
PCRE_ANCHORED Match only at the first position
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
diff --git a/doc/html/pcre_exec.html b/doc/html/pcre_exec.html
index e4ddf9a..0cc3bb7 100644
--- a/doc/html/pcre_exec.html
+++ b/doc/html/pcre_exec.html
@@ -45,14 +45,15 @@ offsets to captured substrings. Its arguments are:
<i>extra</i> Points to an associated <b>pcre[16|32]_extra</b> structure,
or is NULL
<i>subject</i> Points to the subject string
- <i>length</i> Length of the subject string, in bytes
- <i>startoffset</i> Offset in bytes in the subject at which to
- start matching
+ <i>length</i> Length of the subject string
+ <i>startoffset</i> Offset in the subject at which to start matching
<i>options</i> Option bits
<i>ovector</i> Points to a vector of ints for result offsets
<i>ovecsize</i> Number of elements in the vector (a multiple of 3)
</pre>
-The options are:
+The units for <i>length</i> and <i>startoffset</i> are bytes for
+<b>pcre_exec()</b>, 16-bit data items for <b>pcre16_exec()</b>, and 32-bit items
+for <b>pcre32_exec()</b>. The options are:
<pre>
PCRE_ANCHORED Match only at the first position
PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index 00c0eb7..34fa096 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -187,10 +187,10 @@ by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the
</P>
<P>
References to bytes and UTF-8 in this document should be read as references to
-16-bit data quantities and UTF-16 when using the 16-bit library, or 32-bit data
-quantities and UTF-32 when using the 32-bit library, unless specified
-otherwise. More details of the specific differences for the 16-bit and 32-bit
-libraries are given in the
+16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data
+units and UTF-32 when using the 32-bit library, unless specified otherwise.
+More details of the specific differences for the 16-bit and 32-bit libraries
+are given in the
<a href="pcre16.html"><b>pcre16</b></a>
and
<a href="pcre32.html"><b>pcre32</b></a>
@@ -558,16 +558,16 @@ Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns
NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual
error message. This is a static string that is part of the library. You must
not try to free it. Normally, the offset from the start of the pattern to the
-byte that was being processed when the error was discovered is placed in the
-variable pointed to by <i>erroffset</i>, which must not be NULL (if it is, an
-immediate error is given). However, for an invalid UTF-8 string, the offset is
-that of the first byte of the failing character.
+data unit that was being processed when the error was discovered is placed in
+the variable pointed to by <i>erroffset</i>, which must not be NULL (if it is,
+an immediate error is given). However, for an invalid UTF-8 or UTF-16 string,
+the offset is that of the first data unit of the failing character.
</P>
<P>
Some errors are not detected until the whole pattern has been scanned; in these
cases, the offset passed back is the length of the pattern. Note that the
-offset is in bytes, not characters, even in UTF-8 mode. It may sometimes point
-into the middle of a UTF-8 character.
+offset is in data units, not characters, even in a UTF mode. It may sometimes
+point into the middle of a UTF-8 or UTF-16 character.
</P>
<P>
If <b>pcre_compile2()</b> is used instead of <b>pcre_compile()</b>, and the
@@ -741,12 +741,14 @@ binary zero character followed by z).
<pre>
PCRE_MULTILINE
</pre>
-By default, PCRE treats the subject string as consisting of a single line of
-characters (even if it actually contains newlines). The "start of line"
-metacharacter (^) matches only at the start of the string, while the "end of
-line" metacharacter ($) matches only at the end of the string, or before a
-terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same as
-Perl.
+By default, for the purposes of matching "start of line" and "end of line",
+PCRE treats the subject string as consisting of a single line of characters,
+even if it actually contains newlines. The "start of line" metacharacter (^)
+matches only at the start of the string, and the "end of line" metacharacter
+($) matches only at the end of the string, or before a terminating newline
+(except when PCRE_DOLLAR_ENDONLY is set). Note, however, that unless
+PCRE_DOTALL is set, the "any character" metacharacter (.) does not match at a
+newline. This behaviour (for ^, $, and dot) is the same as Perl.
</P>
<P>
When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs
@@ -1314,7 +1316,7 @@ call to <b>pcre_fullinfo()</b> returns the error PCRE_ERROR_UNSET.
<pre>
PCRE_INFO_MAXLOOKBEHIND
</pre>
-Return the number of characters (NB not bytes) in the longest lookbehind
+Return the number of characters (NB not data units) in the longest lookbehind
assertion in the pattern. This information is useful when doing multi-segment
matching using the partial matching facilities. Note that the simple assertions
\b and \B require a one-character lookbehind. \A also registers a
@@ -1328,11 +1330,11 @@ segment.
</pre>
If the pattern was studied and a minimum length for matching subject strings
was computed, its value is returned. Otherwise the returned value is -1. The
-value is a number of characters, which in UTF-8 mode may be different from the
-number of bytes. The fourth argument should point to an <b>int</b> variable. A
-non-negative value is a lower bound to the length of any matching string. There
-may not be any strings of that length that do actually match, but every string
-that does match is at least that long.
+value is a number of characters, which in UTF mode may be different from the
+number of data units. The fourth argument should point to an <b>int</b>
+variable. A non-negative value is a lower bound to the length of any matching
+string. There may not be any strings of that length that do actually match, but
+every string that does match is at least that long.
<pre>
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
@@ -1356,10 +1358,10 @@ length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
entry of the table. This is a pointer to <b>char</b> in the 8-bit library, where
the first two bytes of each entry are the number of the capturing parenthesis,
most significant byte first. In the 16-bit library, the pointer points to
-16-bit data units, the first of which contains the parenthesis number.
-In the 32-bit library, the pointer points to 32-bit data units, the first of
-which contains the parenthesis number. The rest
-of the entry is the corresponding name, zero terminated.
+16-bit data units, the first of which contains the parenthesis number. In the
+32-bit library, the pointer points to 32-bit data units, the first of which
+contains the parenthesis number. The rest of the entry is the corresponding
+name, zero terminated.
</P>
<P>
The names are in alphabetical order. Duplicate names may appear if (?| is used
@@ -1433,7 +1435,7 @@ set, the call to <b>pcre_fullinfo()</b> returns the error PCRE_ERROR_UNSET.
<pre>
PCRE_INFO_SIZE
</pre>
-Return the size of the compiled pattern in bytes (for both libraries). The
+Return the size of the compiled pattern in bytes (for all three libraries). The
fourth argument should point to a <b>size_t</b> variable. This value does not
include the size of the <b>pcre</b> structure that is returned by
<b>pcre_compile()</b>. The value that is passed as the argument to
@@ -1444,11 +1446,12 @@ does not alter the value returned by this option.
<pre>
PCRE_INFO_STUDYSIZE
</pre>
-Return the size in bytes of the data block pointed to by the <i>study_data</i>
-field in a <b>pcre_extra</b> block. If <b>pcre_extra</b> is NULL, or there is no
-study data, zero is returned. The fourth argument should point to a
-<b>size_t</b> variable. The <i>study_data</i> field is set by <b>pcre_study()</b>
-to record information that will speed up matching (see the section entitled
+Return the size in bytes (for all three libraries) of the data block pointed to
+by the <i>study_data</i> field in a <b>pcre_extra</b> block. If <b>pcre_extra</b>
+is NULL, or there is no study data, zero is returned. The fourth argument
+should point to a <b>size_t</b> variable. The <i>study_data</i> field is set by
+<b>pcre_study()</b> to record information that will speed up matching (see the
+section entitled
<a href="#studyingapattern">"Studying a pattern"</a>
above). The format of the <i>study_data</i> block is private, but its length
is made available via this option so that it can be saved and restored (see the
@@ -1982,13 +1985,19 @@ The string to be matched by <b>pcre_exec()</b>
</b><br>
<P>
The subject string is passed to <b>pcre_exec()</b> as a pointer in
-<i>subject</i>, a length in bytes in <i>length</i>, and a starting byte offset
-in <i>startoffset</i>. If this is negative or greater than the length of the
-subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. When the starting
-offset is zero, the search for a match starts at the beginning of the subject,
-and this is by far the most common case. In UTF-8 mode, the byte offset must
-point to the start of a UTF-8 character (or the end of the subject). Unlike the
-pattern string, the subject may contain binary zero bytes.
+<i>subject</i>, a length in <i>length</i>, and a starting offset in
+<i>startoffset</i>. The units for <i>length</i> and <i>startoffset</i> are bytes
+for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit
+data items for the 32-bit library.
+</P>
+<P>
+If <i>startoffset</i> is negative or greater than the length of the subject,
+<b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. When the starting offset is
+zero, the search for a match starts at the beginning of the subject, and this
+is by far the most common case. In UTF-8 or UTF-16 mode, the offset must point
+to the start of a character, or the end of the subject (in UTF-32 mode, one
+data unit equals one character, so all offsets are valid). Unlike the pattern
+string, the subject may contain binary zeroes.
</P>
<P>
A non-zero starting offset is useful when searching for another match in the
@@ -2056,10 +2065,12 @@ rounded down.
When a match is successful, information about captured substrings is returned
in pairs of integers, starting at the beginning of <i>ovector</i>, and
continuing up to two-thirds of its length at the most. The first element of
-each pair is set to the byte offset of the first character in a substring, and
-the second is set to the byte offset of the first character after the end of a
-substring. <b>Note</b>: these values are always byte offsets, even in UTF-8
-mode. They are not character counts.
+each pair is set to the offset of the first character in a substring, and the
+second is set to the offset of the first character after the end of a
+substring. These values are always data unit offsets, even in UTF mode. They
+are byte offsets in the 8-bit library, 16-bit data item offsets in the 16-bit
+library, and 32-bit data item offsets in the 32-bit library. <b>Note</b>: they
+are not character counts.
</P>
<P>
The first pair of integers, <i>ovector[0]</i> and <i>ovector[1]</i>, identify the
@@ -2839,7 +2850,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC26" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 26 April 2013
+Last updated: 12 May 2013
<br>
Copyright &copy; 1997-2013 University of Cambridge.
<br>
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 0f50aee..9f5e3dd 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -392,8 +392,10 @@ STRUCTURE TYPES
SUBJECT STRING OFFSETS
- The offsets within subject strings that are returned by the matching
- functions are in 16-bit units rather than bytes.
+ The lengths and starting offsets of subject strings must be specified
+ in 16-bit data units, and the offsets within subject strings that are
+ returned by the matching functions are in also 16-bit units rather than
+ bytes.
NAMED SUBPATTERNS
@@ -506,8 +508,8 @@ AUTHOR
REVISION
- Last updated: 08 November 2012
- Copyright (c) 1997-2012 University of Cambridge.
+ Last updated: 12 May 2013
+ Copyright (c) 1997-2013 University of Cambridge.
------------------------------------------------------------------------------
@@ -722,8 +724,10 @@ STRUCTURE TYPES
SUBJECT STRING OFFSETS
- The offsets within subject strings that are returned by the matching
- functions are in 32-bit units rather than bytes.
+ The lengths and starting offsets of subject strings must be specified
+ in 32-bit data units, and the offsets within subject strings that are
+ returned by the matching functions are in also 32-bit units rather than
+ bytes.
NAMED SUBPATTERNS
@@ -833,8 +837,8 @@ AUTHOR
REVISION
- Last updated: 08 November 2012
- Copyright (c) 1997-2012 University of Cambridge.
+ Last updated: 12 May 2013
+ Copyright (c) 1997-2013 University of Cambridge.
------------------------------------------------------------------------------
@@ -1668,68 +1672,67 @@ PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
ues.
References to bytes and UTF-8 in this document should be read as refer-
- ences to 16-bit data quantities and UTF-16 when using the 16-bit
- library, or 32-bit data quantities and UTF-32 when using the 32-bit
- library, unless specified otherwise. More details of the specific dif-
- ferences for the 16-bit and 32-bit libraries are given in the pcre16
- and pcre32 pages.
+ ences to 16-bit data units and UTF-16 when using the 16-bit library, or
+ 32-bit data units and UTF-32 when using the 32-bit library, unless
+ specified otherwise. More details of the specific differences for the
+ 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
PCRE API OVERVIEW
PCRE has its own native API, which is described in this document. There
- are also some wrapper functions (for the 8-bit library only) that cor-
- respond to the POSIX regular expression API, but they do not give
- access to all the functionality. They are described in the pcreposix
- documentation. Both of these APIs define a set of C function calls. A
+ are also some wrapper functions (for the 8-bit library only) that cor-
+ respond to the POSIX regular expression API, but they do not give
+ access to all the functionality. They are described in the pcreposix
+ documentation. Both of these APIs define a set of C function calls. A
C++ wrapper (again for the 8-bit library only) is also distributed with
PCRE. It is documented in the pcrecpp page.
- The native API C function prototypes are defined in the header file
- pcre.h, and on Unix-like systems the (8-bit) library itself is called
- libpcre. It can normally be accessed by adding -lpcre to the command
- for linking an application that uses PCRE. The header file defines the
+ The native API C function prototypes are defined in the header file
+ pcre.h, and on Unix-like systems the (8-bit) library itself is called
+ libpcre. It can normally be accessed by adding -lpcre to the command
+ for linking an application that uses PCRE. The header file defines the
macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
- numbers for the library. Applications can use these to include support
+ numbers for the library. Applications can use these to include support
for different releases of PCRE.
In a Windows environment, if you want to statically link an application
- program against a non-dll pcre.a file, you must define PCRE_STATIC
- before including pcre.h or pcrecpp.h, because otherwise the pcre_mal-
+ program against a non-dll pcre.a file, you must define PCRE_STATIC
+ before including pcre.h or pcrecpp.h, because otherwise the pcre_mal-
loc() and pcre_free() exported functions will be declared
__declspec(dllimport), with unwanted results.
- The functions pcre_compile(), pcre_compile2(), pcre_study(), and
- pcre_exec() are used for compiling and matching regular expressions in
- a Perl-compatible manner. A sample program that demonstrates the sim-
- plest way of using them is provided in the file called pcredemo.c in
+ The functions pcre_compile(), pcre_compile2(), pcre_study(), and
+ pcre_exec() are used for compiling and matching regular expressions in
+ a Perl-compatible manner. A sample program that demonstrates the sim-
+ plest way of using them is provided in the file called pcredemo.c in
the PCRE source distribution. A listing of this program is given in the
- pcredemo documentation, and the pcresample documentation describes how
+ pcredemo documentation, and the pcresample documentation describes how
to compile and run it.
- Just-in-time compiler support is an optional feature of PCRE that can
+ Just-in-time compiler support is an optional feature of PCRE that can
be built in appropriate hardware environments. It greatly speeds up the
- matching performance of many patterns. Simple programs can easily
- request that it be used if available, by setting an option that is
- ignored when it is not relevant. More complicated programs might need
- to make use of the functions pcre_jit_stack_alloc(),
- pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control
+ matching performance of many patterns. Simple programs can easily
+ request that it be used if available, by setting an option that is
+ ignored when it is not relevant. More complicated programs might need
+ to make use of the functions pcre_jit_stack_alloc(),
+ pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control
the JIT code's memory usage.
- From release 8.32 there is also a direct interface for JIT execution,
- which gives improved performance. The JIT-specific functions are dis-
+ From release 8.32 there is also a direct interface for JIT execution,
+ which gives improved performance. The JIT-specific functions are dis-
cussed in the pcrejit documentation.
A second matching function, pcre_dfa_exec(), which is not Perl-compati-
- ble, is also provided. This uses a different algorithm for the match-
- ing. The alternative algorithm finds all possible matches (at a given
- point in the subject), and scans the subject just once (unless there
- are lookbehind assertions). However, this algorithm does not return
- captured substrings. A description of the two matching algorithms and
- their advantages and disadvantages is given in the pcrematching docu-
+ ble, is also provided. This uses a different algorithm for the match-
+ ing. The alternative algorithm finds all possible matches (at a given
+ point in the subject), and scans the subject just once (unless there
+ are lookbehind assertions). However, this algorithm does not return
+ captured substrings. A description of the two matching algorithms and
+ their advantages and disadvantages is given in the pcrematching docu-
mentation.
- In addition to the main compiling and matching functions, there are
+ In addition to the main compiling and matching functions, there are
convenience functions for extracting captured substrings from a subject
string that is matched by pcre_exec(). They are:
@@ -1744,105 +1747,105 @@ PCRE API OVERVIEW
pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings.
- The function pcre_maketables() is used to build a set of character
- tables in the current locale for passing to pcre_compile(),
- pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
- provided for specialist use. Most commonly, no special tables are
- passed, in which case internal tables that are generated when PCRE is
+ The function pcre_maketables() is used to build a set of character
+ tables in the current locale for passing to pcre_compile(),
+ pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
+ provided for specialist use. Most commonly, no special tables are
+ passed, in which case internal tables that are generated when PCRE is
built are used.
- The function pcre_fullinfo() is used to find out information about a
- compiled pattern. The function pcre_version() returns a pointer to a
+ The function pcre_fullinfo() is used to find out information about a
+ compiled pattern. The function pcre_version() returns a pointer to a
string containing the version of PCRE and its date of release.
- The function pcre_refcount() maintains a reference count in a data
- block containing a compiled pattern. This is provided for the benefit
+ The function pcre_refcount() maintains a reference count in a data
+ block containing a compiled pattern. This is provided for the benefit
of object-oriented applications.
- The global variables pcre_malloc and pcre_free initially contain the
- entry points of the standard malloc() and free() functions, respec-
+ The global variables pcre_malloc and pcre_free initially contain the
+ entry points of the standard malloc() and free() functions, respec-
tively. PCRE calls the memory management functions via these variables,
- so a calling program can replace them if it wishes to intercept the
+ so a calling program can replace them if it wishes to intercept the
calls. This should be done before calling any PCRE functions.
- The global variables pcre_stack_malloc and pcre_stack_free are also
- indirections to memory management functions. These special functions
- are used only when PCRE is compiled to use the heap for remembering
+ The global variables pcre_stack_malloc and pcre_stack_free are also
+ indirections to memory management functions. These special functions
+ are used only when PCRE is compiled to use the heap for remembering
data, instead of recursive function calls, when running the pcre_exec()
- function. See the pcrebuild documentation for details of how to do
- this. It is a non-standard way of building PCRE, for use in environ-
- ments that have limited stacks. Because of the greater use of memory
- management, it runs more slowly. Separate functions are provided so
- that special-purpose external code can be used for this case. When
- used, these functions are always called in a stack-like manner (last
- obtained, first freed), and always for memory blocks of the same size.
- There is a discussion about PCRE's stack usage in the pcrestack docu-
+ function. See the pcrebuild documentation for details of how to do
+ this. It is a non-standard way of building PCRE, for use in environ-
+ ments that have limited stacks. Because of the greater use of memory
+ management, it runs more slowly. Separate functions are provided so
+ that special-purpose external code can be used for this case. When
+ used, these functions are always called in a stack-like manner (last
+ obtained, first freed), and always for memory blocks of the same size.
+ There is a discussion about PCRE's stack usage in the pcrestack docu-
mentation.
The global variable pcre_callout initially contains NULL. It can be set
- by the caller to a "callout" function, which PCRE will then call at
- specified points during a matching operation. Details are given in the
+ by the caller to a "callout" function, which PCRE will then call at
+ specified points during a matching operation. Details are given in the
pcrecallout documentation.
NEWLINES
- PCRE supports five different conventions for indicating line breaks in
- strings: a single CR (carriage return) character, a single LF (line-
+ PCRE supports five different conventions for indicating line breaks in
+ strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
- ceding, or any Unicode newline sequence. The Unicode newline sequences
- are the three just mentioned, plus the single characters VT (vertical
+ ceding, or any Unicode newline sequence. The Unicode newline sequences
+ are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
- Each of the first three conventions is used by at least one operating
- system as its standard newline sequence. When PCRE is built, a default
- can be specified. The default default is LF, which is the Unix stan-
- dard. When PCRE is run, the default can be overridden, either when a
+ Each of the first three conventions is used by at least one operating
+ system as its standard newline sequence. When PCRE is built, a default
+ can be specified. The default default is LF, which is the Unix stan-
+ dard. When PCRE is run, the default can be overridden, either when a
pattern is compiled, or when it is matched.
At compile time, the newline convention can be specified by the options
- argument of pcre_compile(), or it can be specified by special text at
+ argument of pcre_compile(), or it can be specified by special text at
the start of the pattern itself; this overrides any other settings. See
the pcrepattern page for details of the special character sequences.
In the PCRE documentation the word "newline" is used to mean "the char-
- acter or pair of characters that indicate a line break". The choice of
- newline convention affects the handling of the dot, circumflex, and
+ acter or pair of characters that indicate a line break". The choice of
+ newline convention affects the handling of the dot, circumflex, and
dollar metacharacters, the handling of #-comments in /x mode, and, when
- CRLF is a recognized line ending sequence, the match position advance-
+ CRLF is a recognized line ending sequence, the match position advance-
ment for a non-anchored pattern. There is more detail about this in the
section on pcre_exec() options below.
- The choice of newline convention does not affect the interpretation of
- the \n or \r escape sequences, nor does it affect what \R matches,
+ The choice of newline convention does not affect the interpretation of
+ the \n or \r escape sequences, nor does it affect what \R matches,
which is controlled in a similar way, but by separate options.
MULTITHREADING
- The PCRE functions can be used in multi-threading applications, with
+ The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout function pointed to by pcre_callout, are shared by all threads.
- The compiled form of a regular expression is not altered during match-
+ The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.
- If the just-in-time optimization feature is being used, it needs sepa-
- rate memory stack areas for each thread. See the pcrejit documentation
+ If the just-in-time optimization feature is being used, it needs sepa-
+ rate memory stack areas for each thread. See the pcrejit documentation
for more details.
SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a
- later time, possibly by a different program, and even on a host other
- than the one on which it was compiled. Details are given in the
- pcreprecompile documentation, which includes a description of the
- pcre_pattern_to_host_byte_order() function. However, compiling a regu-
- lar expression with one version of PCRE for use with a different ver-
+ later time, possibly by a different program, and even on a host other
+ than the one on which it was compiled. Details are given in the
+ pcreprecompile documentation, which includes a description of the
+ pcre_pattern_to_host_byte_order() function. However, compiling a regu-
+ lar expression with one version of PCRE for use with a different ver-
sion is not guaranteed to work and may cause crashes.
@@ -1850,45 +1853,45 @@ CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);
- The function pcre_config() makes it possible for a PCRE client to dis-
+ The function pcre_config() makes it possible for a PCRE client to dis-
cover which optional features have been compiled into the PCRE library.
- The pcrebuild documentation has more details about these optional fea-
+ The pcrebuild documentation has more details about these optional fea-
tures.
- The first argument for pcre_config() is an integer, specifying which
+ The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable
- into which the information is placed. The returned value is zero on
- success, or the negative error code PCRE_ERROR_BADOPTION if the value
- in the first argument is not recognized. The following information is
+ into which the information is placed. The returned value is zero on
+ success, or the negative error code PCRE_ERROR_BADOPTION if the value
+ in the first argument is not recognized. The following information is
available:
PCRE_CONFIG_UTF8
- The output is an integer that is set to one if UTF-8 support is avail-
- able; otherwise it is set to zero. This value should normally be given
+ The output is an integer that is set to one if UTF-8 support is avail-
+ able; otherwise it is set to zero. This value should normally be given
to the 8-bit version of this function, pcre_config(). If it is given to
- the 16-bit or 32-bit version of this function, the result is
+ the 16-bit or 32-bit version of this function, the result is
PCRE_ERROR_BADOPTION.
PCRE_CONFIG_UTF16
The output is an integer that is set to one if UTF-16 support is avail-
- able; otherwise it is set to zero. This value should normally be given
+ able; otherwise it is set to zero. This value should normally be given
to the 16-bit version of this function, pcre16_config(). If it is given
- to the 8-bit or 32-bit version of this function, the result is
+ to the 8-bit or 32-bit version of this function, the result is
PCRE_ERROR_BADOPTION.
PCRE_CONFIG_UTF32
The output is an integer that is set to one if UTF-32 support is avail-
- able; otherwise it is set to zero. This value should normally be given
+ able; otherwise it is set to zero. This value should normally be given
to the 32-bit version of this function, pcre32_config(). If it is given
- to the 8-bit or 16-bit version of this function, the result is
+ to the 8-bit or 16-bit version of this function, the result is
PCRE_ERROR_BADOPTION.
PCRE_CONFIG_UNICODE_PROPERTIES
- The output is an integer that is set to one if support for Unicode
+ The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero.
PCRE_CONFIG_JIT
@@ -1898,70 +1901,70 @@ CHECKING BUILD-TIME OPTIONS
PCRE_CONFIG_JITTARGET
- The output is a pointer to a zero-terminated "const char *" string. If
+ The output is a pointer to a zero-terminated "const char *" string. If
JIT support is available, the string contains the name of the architec-
- ture for which the JIT compiler is configured, for example "x86 32bit
- (little endian + unaligned)". If JIT support is not available, the
+ ture for which the JIT compiler is configured, for example "x86 32bit
+ (little endian + unaligned)". If JIT support is not available, the
result is NULL.
PCRE_CONFIG_NEWLINE
- The output is an integer whose value specifies the default character
- sequence that is recognized as meaning "newline". The values that are
+ The output is an integer whose value specifies the default character
+ sequence that is recognized as meaning "newline". The values that are
supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
- for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
- ANYCRLF, and ANY yield the same values. However, the value for LF is
- normally 21, though some EBCDIC environments use 37. The corresponding
- values for CRLF are 3349 and 3365. The default should normally corre-
+ for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
+ ANYCRLF, and ANY yield the same values. However, the value for LF is
+ normally 21, though some EBCDIC environments use 37. The corresponding
+ values for CRLF are 3349 and 3365. The default should normally corre-
spond to the standard sequence for your operating system.
PCRE_CONFIG_BSR
The output is an integer whose value indicates what character sequences
- the \R escape sequence matches by default. A value of 0 means that \R
- matches any Unicode line ending sequence; a value of 1 means that \R
+ the \R escape sequence matches by default. A value of 0 means that \R
+ matches any Unicode line ending sequence; a value of 1 means that \R
matches only CR, LF, or CRLF. The default can be overridden when a pat-
tern is compiled or matched.
PCRE_CONFIG_LINK_SIZE
- The output is an integer that contains the number of bytes used for
+ The output is an integer that contains the number of bytes used for
internal linkage in compiled regular expressions. For the 8-bit
library, the value can be 2, 3, or 4. For the 16-bit library, the value
- is either 2 or 4 and is still a number of bytes. For the 32-bit
+ is either 2 or 4 and is still a number of bytes. For the 32-bit
library, the value is either 2 or 4 and is still a number of bytes. The
default value of 2 is sufficient for all but the most massive patterns,
- since it allows the compiled pattern to be up to 64K in size. Larger
- values allow larger regular expressions to be compiled, at the expense
+ since it allows the compiled pattern to be up to 64K in size. Larger
+ values allow larger regular expressions to be compiled, at the expense
of slower matching.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
- The output is an integer that contains the threshold above which the
- POSIX interface uses malloc() for output vectors. Further details are
+ The output is an integer that contains the threshold above which the
+ POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation.
PCRE_CONFIG_MATCH_LIMIT
- The output is a long integer that gives the default limit for the num-
- ber of internal matching function calls in a pcre_exec() execution.
+ The output is a long integer that gives the default limit for the num-
+ ber of internal matching function calls in a pcre_exec() execution.
Further details are given with pcre_exec() below.
PCRE_CONFIG_MATCH_LIMIT_RECURSION
The output is a long integer that gives the default limit for the depth
- of recursion when calling the internal matching function in a
- pcre_exec() execution. Further details are given with pcre_exec()
+ of recursion when calling the internal matching function in a
+ pcre_exec() execution. Further details are given with pcre_exec()
below.
PCRE_CONFIG_STACKRECURSE
- The output is an integer that is set to one if internal recursion when
+ The output is an integer that is set to one if internal recursion when
running pcre_exec() is implemented by recursive function calls that use
- the stack to remember their state. This is the usual way that PCRE is
+ the stack to remember their state. This is the usual way that PCRE is
compiled. The output is zero if PCRE was compiled to use blocks of data
- on the heap instead of recursive function calls. In this case,
- pcre_stack_malloc and pcre_stack_free are called to manage memory
+ on the heap instead of recursive function calls. In this case,
+ pcre_stack_malloc and pcre_stack_free are called to manage memory
blocks on the heap, thus avoiding the use of the stack.
@@ -1978,65 +1981,67 @@ COMPILING A PATTERN
Either of the functions pcre_compile() or pcre_compile2() can be called
to compile a pattern into an internal form. The only difference between
- the two interfaces is that pcre_compile2() has an additional argument,
- errorcodeptr, via which a numerical error code can be returned. To
- avoid too much repetition, we refer just to pcre_compile() below, but
+ the two interfaces is that pcre_compile2() has an additional argument,
+ errorcodeptr, via which a numerical error code can be returned. To
+ avoid too much repetition, we refer just to pcre_compile() below, but
the information applies equally to pcre_compile2().
The pattern is a C string terminated by a binary zero, and is passed in
- the pattern argument. A pointer to a single block of memory that is
- obtained via pcre_malloc is returned. This contains the compiled code
+ the pattern argument. A pointer to a single block of memory that is
+ obtained via pcre_malloc is returned. This contains the compiled code
and related data. The pcre type is defined for the returned block; this
is a typedef for a structure whose contents are not externally defined.
It is up to the caller to free the memory (via pcre_free) when it is no
longer required.
- Although the compiled code of a PCRE regex is relocatable, that is, it
+ Although the compiled code of a PCRE regex is relocatable, that is, it
does not depend on memory location, the complete pcre data block is not
- fully relocatable, because it may contain a copy of the tableptr argu-
+ fully relocatable, because it may contain a copy of the tableptr argu-
ment, which is an address (see below).
The options argument contains various bit settings that affect the com-
- pilation. It should be zero if no options are required. The available
- options are described below. Some of them (in particular, those that
- are compatible with Perl, but some others as well) can also be set and
- unset from within the pattern (see the detailed description in the
- pcrepattern documentation). For those options that can be different in
- different parts of the pattern, the contents of the options argument
+ pilation. It should be zero if no options are required. The available
+ options are described below. Some of them (in particular, those that
+ are compatible with Perl, but some others as well) can also be set and
+ unset from within the pattern (see the detailed description in the
+ pcrepattern documentation). For those options that can be different in
+ different parts of the pattern, the contents of the options argument
specifies their settings at the start of compilation and execution. The
- PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
- PCRE_NO_START_OPTIMIZE options can be set at the time of matching as
+ PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
+ PCRE_NO_START_OPTIMIZE options can be set at the time of matching as
well as at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
- if compilation of a pattern fails, pcre_compile() returns NULL, and
+ if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes-
sage. This is a static string that is part of the library. You must not
- try to free it. Normally, the offset from the start of the pattern to
- the byte that was being processed when the error was discovered is
- placed in the variable pointed to by erroffset, which must not be NULL
- (if it is, an immediate error is given). However, for an invalid UTF-8
- string, the offset is that of the first byte of the failing character.
+ try to free it. Normally, the offset from the start of the pattern to
+ the data unit that was being processed when the error was discovered is
+ placed in the variable pointed to by erroffset, which must not be NULL
+ (if it is, an immediate error is given). However, for an invalid UTF-8
+ or UTF-16 string, the offset is that of the first data unit of the
+ failing character.
Some errors are not detected until the whole pattern has been scanned;
in these cases, the offset passed back is the length of the pattern.
- Note that the offset is in bytes, not characters, even in UTF-8 mode.
- It may sometimes point into the middle of a UTF-8 character.
+ Note that the offset is in data units, not characters, even in a UTF
+ mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
+ acter.
- If pcre_compile2() is used instead of pcre_compile(), and the error-
- codeptr argument is not NULL, a non-zero error code number is returned
- via this argument in the event of an error. This is in addition to the
+ If pcre_compile2() is used instead of pcre_compile(), and the error-
+ codeptr argument is not NULL, a non-zero error code number is returned
+ via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below.
- If the final argument, tableptr, is NULL, PCRE uses a default set of
- character tables that are built when PCRE is compiled, using the
- default C locale. Otherwise, tableptr must be an address that is the
- result of a call to pcre_maketables(). This value is stored with the
- compiled pattern, and used again by pcre_exec(), unless another table
+ If the final argument, tableptr, is NULL, PCRE uses a default set of
+ character tables that are built when PCRE is compiled, using the
+ default C locale. Otherwise, tableptr must be an address that is the
+ result of a call to pcre_maketables(). This value is stored with the
+ compiled pattern, and used again by pcre_exec(), unless another table
pointer is passed to it. For more discussion, see the section on locale
support below.
- This code fragment shows a typical straightforward call to pcre_com-
+ This code fragment shows a typical straightforward call to pcre_com-
pile():
pcre *re;
@@ -2049,154 +2054,157 @@ COMPILING A PATTERN
&erroffset, /* for error offset */
NULL); /* use default character tables */
- The following names for option bits are defined in the pcre.h header
+ The following names for option bits are defined in the pcre.h header
file:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
- is constrained to match only at the first matching point in the string
- that is being searched (the "subject string"). This effect can also be
- achieved by appropriate constructs in the pattern itself, which is the
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE_AUTO_CALLOUT
If this bit is set, pcre_compile() automatically inserts callout items,
- all with number 255, before each pattern item. For discussion of the
+ all with number 255, before each pattern item. For discussion of the
callout facility, see the pcrecallout documentation.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
- sequence matches. The choice is either to match only CR, LF, or CRLF,
+ sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. The default is specified when
PCRE is built. It can be overridden from within the pattern, or by set-
ting an option when a compiled pattern is matched.
PCRE_CASELESS
- If this bit is set, letters in the pattern match both upper and lower
- case letters. It is equivalent to Perl's /i option, and it can be
- changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
- always understands the concept of case for characters whose values are
- less than 128, so caseless matching is always possible. For characters
- with higher values, the concept of case is supported if PCRE is com-
- piled with Unicode property support, but not otherwise. If you want to
- use caseless matching for characters 128 and above, you must ensure
- that PCRE is compiled with Unicode property support as well as with
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters. It is equivalent to Perl's /i option, and it can be
+ changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
+ always understands the concept of case for characters whose values are
+ less than 128, so caseless matching is always possible. For characters
+ with higher values, the concept of case is supported if PCRE is com-
+ piled with Unicode property support, but not otherwise. If you want to
+ use caseless matching for characters 128 and above, you must ensure
+ that PCRE is compiled with Unicode property support as well as with
UTF-8 support.
PCRE_DOLLAR_ENDONLY
- If this bit is set, a dollar metacharacter in the pattern matches only
- at the end of the subject string. Without this option, a dollar also
- matches immediately before a newline at the end of the string (but not
- before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
- if PCRE_MULTILINE is set. There is no equivalent to this option in
+ If this bit is set, a dollar metacharacter in the pattern matches only
+ at the end of the subject string. Without this option, a dollar also
+ matches immediately before a newline at the end of the string (but not
+ before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
+ if PCRE_MULTILINE is set. There is no equivalent to this option in
Perl, and no way to set it within a pattern.
PCRE_DOTALL
- If this bit is set, a dot metacharacter in the pattern matches a char-
+ If this bit is set, a dot metacharacter in the pattern matches a char-
acter of any value, including one that indicates a newline. However, it
- only ever matches one character, even if newlines are coded as CRLF.
- Without this option, a dot does not match when the current position is
+ only ever matches one character, even if newlines are coded as CRLF.
+ Without this option, a dot does not match when the current position is
at a newline. This option is equivalent to Perl's /s option, and it can
- be changed within a pattern by a (?s) option setting. A negative class
+ be changed within a pattern by a (?s) option setting. A negative class
such as [^a] always matches newline characters, independent of the set-
ting of this option.
PCRE_DUPNAMES
- If this bit is set, names used to identify capturing subpatterns need
+ If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
- is known that only one instance of the named subpattern can ever be
- matched. There are more details of named subpatterns below; see also
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
the pcrepattern documentation.
PCRE_EXTENDED
- If this bit is set, white space data characters in the pattern are
- totally ignored except when escaped or inside a character class. White
+ If this bit is set, white space data characters in the pattern are
+ totally ignored except when escaped or inside a character class. White
space does not include the VT character (code 11). In addition, charac-
ters between an unescaped # outside a character class and the next new-
- line, inclusive, are also ignored. This is equivalent to Perl's /x
- option, and it can be changed within a pattern by a (?x) option set-
+ line, inclusive, are also ignored. This is equivalent to Perl's /x
+ option, and it can be changed within a pattern by a (?x) option set-
ting.
- Which characters are interpreted as newlines is controlled by the
- options passed to pcre_compile() or by a special sequence at the start
- of the pattern, as described in the section entitled "Newline conven-
+ Which characters are interpreted as newlines is controlled by the
+ options passed to pcre_compile() or by a special sequence at the start
+ of the pattern, as described in the section entitled "Newline conven-
tions" in the pcrepattern documentation. Note that the end of this type
- of comment is a literal newline sequence in the pattern; escape
+ of comment is a literal newline sequence in the pattern; escape
sequences that happen to represent a newline do not count.
- This option makes it possible to include comments inside complicated
- patterns. Note, however, that this applies only to data characters.
- White space characters may never appear within special character
+ This option makes it possible to include comments inside complicated
+ patterns. Note, however, that this applies only to data characters.
+ White space characters may never appear within special character
sequences in a pattern, for example within the sequence (?( that intro-
duces a conditional subpattern.
PCRE_EXTRA
- This option was invented in order to turn on additional functionality
- of PCRE that is incompatible with Perl, but it is currently of very
- little use. When set, any backslash in a pattern that is followed by a
- letter that has no special meaning causes an error, thus reserving
- these combinations for future expansion. By default, as in Perl, a
- backslash followed by a letter with no special meaning is treated as a
+ This option was invented in order to turn on additional functionality
+ of PCRE that is incompatible with Perl, but it is currently of very
+ little use. When set, any backslash in a pattern that is followed by a
+ letter that has no special meaning causes an error, thus reserving
+ these combinations for future expansion. By default, as in Perl, a
+ backslash followed by a letter with no special meaning is treated as a
literal. (Perl can, however, be persuaded to give an error for this, by
- running it with the -w option.) There are at present no other features
- controlled by this option. It can also be set by a (?X) option setting
+ running it with the -w option.) There are at present no other features
+ controlled by this option. It can also be set by a (?X) option setting
within a pattern.
PCRE_FIRSTLINE
- If this option is set, an unanchored pattern is required to match
- before or at the first newline in the subject string, though the
+ If this option is set, an unanchored pattern is required to match
+ before or at the first newline in the subject string, though the
matched text may continue over the newline.
PCRE_JAVASCRIPT_COMPAT
If this option is set, PCRE's behaviour is changed in some ways so that
- it is compatible with JavaScript rather than Perl. The changes are as
+ it is compatible with JavaScript rather than Perl. The changes are as
follows:
- (1) A lone closing square bracket in a pattern causes a compile-time
- error, because this is illegal in JavaScript (by default it is treated
+ (1) A lone closing square bracket in a pattern causes a compile-time
+ error, because this is illegal in JavaScript (by default it is treated
as a data character). Thus, the pattern AB]CD becomes illegal when this
option is set.
- (2) At run time, a back reference to an unset subpattern group matches
- an empty string (by default this causes the current matching alterna-
- tive to fail). A pattern such as (\1)(a) succeeds when this option is
- set (assuming it can find an "a" in the subject), whereas it fails by
+ (2) At run time, a back reference to an unset subpattern group matches
+ an empty string (by default this causes the current matching alterna-
+ tive to fail). A pattern such as (\1)(a) succeeds when this option is
+ set (assuming it can find an "a" in the subject), whereas it fails by
default, for Perl compatibility.
(3) \U matches an upper case "U" character; by default \U causes a com-
pile time error (Perl uses \U to upper case subsequent characters).
(4) \u matches a lower case "u" character unless it is followed by four
- hexadecimal digits, in which case the hexadecimal number defines the
- code point to match. By default, \u causes a compile time error (Perl
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, \u causes a compile time error (Perl
uses it to upper case the following character).
- (5) \x matches a lower case "x" character unless it is followed by two
- hexadecimal digits, in which case the hexadecimal number defines the
- code point to match. By default, as in Perl, a hexadecimal number is
+ (5) \x matches a lower case "x" character unless it is followed by two
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, as in Perl, a hexadecimal number is
always expected after \x, but it may have zero, one, or two digits (so,
for example, \xz matches a binary zero character followed by z).
PCRE_MULTILINE
- By default, PCRE treats the subject string as consisting of a single
- line of characters (even if it actually contains newlines). The "start
- of line" metacharacter (^) matches only at the start of the string,
- while the "end of line" metacharacter ($) matches only at the end of
- the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
- is set). This is the same as Perl.
+ By default, for the purposes of matching "start of line" and "end of
+ line", PCRE treats the subject string as consisting of a single line of
+ characters, even if it actually contains newlines. The "start of line"
+ metacharacter (^) matches only at the start of the string, and the "end
+ of line" metacharacter ($) matches only at the end of the string, or
+ before a terminating newline (except when PCRE_DOLLAR_ENDONLY is set).
+ Note, however, that unless PCRE_DOTALL is set, the "any character"
+ metacharacter (.) does not match at a newline. This behaviour (for ^,
+ $, and dot) is the same as Perl.
When PCRE_MULTILINE it is set, the "start of line" and "end of line"
constructs match immediately following or immediately before internal
@@ -2736,22 +2744,22 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_MAXLOOKBEHIND
- Return the number of characters (NB not bytes) in the longest lookbe-
- hind assertion in the pattern. This information is useful when doing
- multi-segment matching using the partial matching facilities. Note that
- the simple assertions \b and \B require a one-character lookbehind. \A
- also registers a one-character lookbehind, though it does not actually
- inspect the previous character. This is to ensure that at least one
- character from the old segment is retained when a new segment is pro-
- cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
- match incorrectly at the start of a new segment.
+ Return the number of characters (NB not data units) in the longest
+ lookbehind assertion in the pattern. This information is useful when
+ doing multi-segment matching using the partial matching facilities.
+ Note that the simple assertions \b and \B require a one-character look-
+ behind. \A also registers a one-character lookbehind, though it does
+ not actually inspect the previous character. This is to ensure that at
+ least one character from the old segment is retained when a new segment
+ is processed. Otherwise, if there are no lookbehinds in the pattern, \A
+ might match incorrectly at the start of a new segment.
PCRE_INFO_MINLENGTH
If the pattern was studied and a minimum length for matching subject
strings was computed, its value is returned. Otherwise the returned
- value is -1. The value is a number of characters, which in UTF-8 mode
- may be different from the number of bytes. The fourth argument should
+ value is -1. The value is a number of characters, which in UTF mode may
+ be different from the number of data units. The fourth argument should
point to an int variable. A non-negative value is a lower bound to the
length of any matching string. There may not be any strings of that
length that do actually match, but every string that does match is at
@@ -2779,7 +2787,7 @@ INFORMATION ABOUT A PATTERN
the 8-bit library, where the first two bytes of each entry are the num-
ber of the capturing parenthesis, most significant byte first. In the
16-bit library, the pointer points to 16-bit data units, the first of
- which contains the parenthesis number. In the 32-bit library, the
+ which contains the parenthesis number. In the 32-bit library, the
pointer points to 32-bit data units, the first of which contains the
parenthesis number. The rest of the entry is the corresponding name,
zero terminated.
@@ -2857,26 +2865,26 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE
- Return the size of the compiled pattern in bytes (for both libraries).
- The fourth argument should point to a size_t variable. This value does
- not include the size of the pcre structure that is returned by
- pcre_compile(). The value that is passed as the argument to pcre_mal-
- loc() when pcre_compile() is getting memory in which to place the com-
- piled data is the value returned by this option plus the size of the
- pcre structure. Studying a compiled pattern, with or without JIT, does
- not alter the value returned by this option.
+ Return the size of the compiled pattern in bytes (for all three
+ libraries). The fourth argument should point to a size_t variable. This
+ value does not include the size of the pcre structure that is returned
+ by pcre_compile(). The value that is passed as the argument to
+ pcre_malloc() when pcre_compile() is getting memory in which to place
+ the compiled data is the value returned by this option plus the size of
+ the pcre structure. Studying a compiled pattern, with or without JIT,
+ does not alter the value returned by this option.
PCRE_INFO_STUDYSIZE
- Return the size in bytes of the data block pointed to by the study_data
- field in a pcre_extra block. If pcre_extra is NULL, or there is no
- study data, zero is returned. The fourth argument should point to a
- size_t variable. The study_data field is set by pcre_study() to record
- information that will speed up matching (see the section entitled
- "Studying a pattern" above). The format of the study_data block is pri-
- vate, but its length is made available via this option so that it can
- be saved and restored (see the pcreprecompile documentation for
- details).
+ Return the size in bytes (for all three libraries) of the data block
+ pointed to by the study_data field in a pcre_extra block. If pcre_extra
+ is NULL, or there is no study data, zero is returned. The fourth argu-
+ ment should point to a size_t variable. The study_data field is set by
+ pcre_study() to record information that will speed up matching (see the
+ section entitled "Studying a pattern" above). The format of the
+ study_data block is private, but its length is made available via this
+ option so that it can be saved and restored (see the pcreprecompile
+ documentation for details).
PCRE_INFO_FIRSTCHARACTERFLAGS
@@ -3359,149 +3367,156 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec() as a pointer in subject, a
- length in bytes in length, and a starting byte offset in startoffset.
- If this is negative or greater than the length of the subject,
+ length in length, and a starting offset in startoffset. The units for
+ length and startoffset are bytes for the 8-bit library, 16-bit data
+ items for the 16-bit library, and 32-bit data items for the 32-bit
+ library.
+
+ If startoffset is negative or greater than the length of the subject,
pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
zero, the search for a match starts at the beginning of the subject,
- and this is by far the most common case. In UTF-8 mode, the byte offset
- must point to the start of a UTF-8 character (or the end of the sub-
- ject). Unlike the pattern string, the subject may contain binary zero
- bytes.
-
- A non-zero starting offset is useful when searching for another match
- in the same subject by calling pcre_exec() again after a previous suc-
- cess. Setting startoffset differs from just passing over a shortened
- string and setting PCRE_NOTBOL in the case of a pattern that begins
+ and this is by far the most common case. In UTF-8 or UTF-16 mode, the
+ offset must point to the start of a character, or the end of the sub-
+ ject (in UTF-32 mode, one data unit equals one character, so all off-
+ sets are valid). Unlike the pattern string, the subject may contain
+ binary zeroes.
+
+ A non-zero starting offset is useful when searching for another match
+ in the same subject by calling pcre_exec() again after a previous suc-
+ cess. Setting startoffset differs from just passing over a shortened
+ string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern
\Biss\B
- which finds occurrences of "iss" in the middle of words. (\B matches
- only if the current position in the subject is not a word boundary.)
- When applied to the string "Mississipi" the first call to pcre_exec()
- finds the first occurrence. If pcre_exec() is called again with just
- the remainder of the subject, namely "issipi", it does not match,
+ which finds occurrences of "iss" in the middle of words. (\B matches
+ only if the current position in the subject is not a word boundary.)
+ When applied to the string "Mississipi" the first call to pcre_exec()
+ finds the first occurrence. If pcre_exec() is called again with just
+ the remainder of the subject, namely "issipi", it does not match,
because \B is always false at the start of the subject, which is deemed
- to be a word boundary. However, if pcre_exec() is passed the entire
+ to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur-
- rence of "iss" because it is able to look behind the starting point to
+ rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
- Finding all the matches in a subject is tricky when the pattern can
+ Finding all the matches in a subject is tricky when the pattern can
match an empty string. It is possible to emulate Perl's /g behaviour by
- first trying the match again at the same offset, with the
- PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
- fails, advancing the starting offset and trying an ordinary match
+ first trying the match again at the same offset, with the
+ PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
+ fails, advancing the starting offset and trying an ordinary match
again. There is some code that demonstrates how to do this in the pcre-
demo sample program. In the most general case, you have to check to see
- if the newline convention recognizes CRLF as a newline, and if so, and
+ if the newline convention recognizes CRLF as a newline, and if so, and
the current character is CR followed by LF, advance the starting offset
by two characters instead of one.
- If a non-zero starting offset is passed when the pattern is anchored,
+ If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is made. This can only succeed
- if the pattern does not require the match to be at the start of the
+ if the pattern does not require the match to be at the start of the
subject.
How pcre_exec() returns captured substrings
- In general, a pattern matches a certain portion of the subject, and in
- addition, further substrings from the subject may be picked out by
- parts of the pattern. Following the usage in Jeffrey Friedl's book,
- this is called "capturing" in what follows, and the phrase "capturing
- subpattern" is used for a fragment of a pattern that picks out a sub-
- string. PCRE supports several other kinds of parenthesized subpattern
+ In general, a pattern matches a certain portion of the subject, and in
+ addition, further substrings from the subject may be picked out by
+ parts of the pattern. Following the usage in Jeffrey Friedl's book,
+ this is called "capturing" in what follows, and the phrase "capturing
+ subpattern" is used for a fragment of a pattern that picks out a sub-
+ string. PCRE supports several other kinds of parenthesized subpattern
that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integers
- whose address is passed in ovector. The number of elements in the vec-
- tor is passed in ovecsize, which must be a non-negative number. Note:
+ whose address is passed in ovector. The number of elements in the vec-
+ tor is passed in ovecsize, which must be a non-negative number. Note:
this argument is NOT the size of ovector in bytes.
- The first two-thirds of the vector is used to pass back captured sub-
- strings, each substring using a pair of integers. The remaining third
- of the vector is used as workspace by pcre_exec() while matching cap-
- turing subpatterns, and is not available for passing back information.
- The number passed in ovecsize should always be a multiple of three. If
+ The first two-thirds of the vector is used to pass back captured sub-
+ strings, each substring using a pair of integers. The remaining third
+ of the vector is used as workspace by pcre_exec() while matching cap-
+ turing subpatterns, and is not available for passing back information.
+ The number passed in ovecsize should always be a multiple of three. If
it is not, it is rounded down.
- When a match is successful, information about captured substrings is
- returned in pairs of integers, starting at the beginning of ovector,
- and continuing up to two-thirds of its length at the most. The first
- element of each pair is set to the byte offset of the first character
- in a substring, and the second is set to the byte offset of the first
- character after the end of a substring. Note: these values are always
- byte offsets, even in UTF-8 mode. They are not character counts.
-
- The first pair of integers, ovector[0] and ovector[1], identify the
- portion of the subject string matched by the entire pattern. The next
- pair is used for the first capturing subpattern, and so on. The value
+ When a match is successful, information about captured substrings is
+ returned in pairs of integers, starting at the beginning of ovector,
+ and continuing up to two-thirds of its length at the most. The first
+ element of each pair is set to the offset of the first character in a
+ substring, and the second is set to the offset of the first character
+ after the end of a substring. These values are always data unit off-
+ sets, even in UTF mode. They are byte offsets in the 8-bit library,
+ 16-bit data item offsets in the 16-bit library, and 32-bit data item
+ offsets in the 32-bit library. Note: they are not character counts.
+
+ The first pair of integers, ovector[0] and ovector[1], identify the
+ portion of the subject string matched by the entire pattern. The next
+ pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
- has been set. For example, if two substrings have been captured, the
- returned value is 3. If there are no capturing subpatterns, the return
+ has been set. For example, if two substrings have been captured, the
+ returned value is 3. If there are no capturing subpatterns, the return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.
If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that is returned.
- If the vector is too small to hold all the captured substring offsets,
+ If the vector is too small to hold all the captured substring offsets,
it is used as far as possible (up to two-thirds of its length), and the
- function returns a value of zero. If neither the actual string matched
- nor any captured substrings are of interest, pcre_exec() may be called
- with ovector passed as NULL and ovecsize as zero. However, if the pat-
- tern contains back references and the ovector is not big enough to
- remember the related substrings, PCRE has to get additional memory for
- use during matching. Thus it is usually advisable to supply an ovector
+ function returns a value of zero. If neither the actual string matched
+ nor any captured substrings are of interest, pcre_exec() may be called
+ with ovector passed as NULL and ovecsize as zero. However, if the pat-
+ tern contains back references and the ovector is not big enough to
+ remember the related substrings, PCRE has to get additional memory for
+ use during matching. Thus it is usually advisable to supply an ovector
of reasonable size.
- There are some cases where zero is returned (indicating vector over-
- flow) when in fact the vector is exactly the right size for the final
+ There are some cases where zero is returned (indicating vector over-
+ flow) when in fact the vector is exactly the right size for the final
match. For example, consider the pattern
(a)(?:(b)c|bd)
- If a vector of 6 elements (allowing for only 1 captured substring) is
+ If a vector of 6 elements (allowing for only 1 captured substring) is
given with subject string "abd", pcre_exec() will try to set the second
captured string, thereby recording a vector overflow, before failing to
- match "c" and backing up to try the second alternative. The zero
- return, however, does correctly indicate that the maximum number of
+ match "c" and backing up to try the second alternative. The zero
+ return, however, does correctly indicate that the maximum number of
slots (namely 2) have been filled. In similar cases where there is tem-
- porary overflow, but the final number of used slots is actually less
+ porary overflow, but the final number of used slots is actually less
than the maximum, a non-zero value is returned.
The pcre_fullinfo() function can be used to find out how many capturing
- subpatterns there are in a compiled pattern. The smallest size for
- ovector that will allow for n captured substrings, in addition to the
+ subpatterns there are in a compiled pattern. The smallest size for
+ ovector that will allow for n captured substrings, in addition to the
offsets of the substring matched by the whole pattern, is (n+1)*3.
- It is possible for capturing subpattern number n+1 to match some part
+ It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example,
- if the string "abc" is matched against the pattern (a|(z))(bc) the
+ if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but
- 2 is not. When this happens, both values in the offset pairs corre-
+ 2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to -1.
- Offset values that correspond to unused subpatterns at the end of the
- expression are also set to -1. For example, if the string "abc" is
- matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
- matched. The return from the function is 2, because the highest used
- capturing subpattern number is 1, and the offsets for for the second
- and third capturing subpatterns (assuming the vector is large enough,
+ Offset values that correspond to unused subpatterns at the end of the
+ expression are also set to -1. For example, if the string "abc" is
+ matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
+ matched. The return from the function is 2, because the highest used
+ capturing subpattern number is 1, and the offsets for for the second
+ and third capturing subpatterns (assuming the vector is large enough,
of course) are set to -1.
- Note: Elements in the first two-thirds of ovector that do not corre-
- spond to capturing parentheses in the pattern are never changed. That
- is, if a pattern contains n capturing parentheses, no more than ovec-
- tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
+ Note: Elements in the first two-thirds of ovector that do not corre-
+ spond to capturing parentheses in the pattern are never changed. That
+ is, if a pattern contains n capturing parentheses, no more than ovec-
+ tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
the first two-thirds) retain whatever values they previously had.
- Some convenience functions are provided for extracting the captured
+ Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below.
Error return values from pcre_exec()
- If pcre_exec() fails, it returns a negative number. The following are
+ If pcre_exec() fails, it returns a negative number. The following are
defined in the header file:
PCRE_ERROR_NOMATCH (-1)
@@ -3510,7 +3525,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_NULL (-2)
- Either code or subject was passed as NULL, or ovector was NULL and
+ Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
@@ -3519,82 +3534,82 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_BADMAGIC (-4)
- PCRE stores a 4-byte "magic number" at the start of the compiled code,
+ PCRE stores a 4-byte "magic number" at the start of the compiled code,
to catch the case when it is passed a junk pointer and to detect when a
pattern that was compiled in an environment of one endianness is run in
- an environment with the other endianness. This is the error that PCRE
+ an environment with the other endianness. This is the error that PCRE
gives when the magic number is not present.
PCRE_ERROR_UNKNOWN_OPCODE (-5)
While running the pattern match, an unknown item was encountered in the
- compiled pattern. This error could be caused by a bug in PCRE or by
+ compiled pattern. This error could be caused by a bug in PCRE or by
overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
- If a pattern contains back references, but the ovector that is passed
+ If a pattern contains back references, but the ovector that is passed
to pcre_exec() is not big enough to remember the referenced substrings,
- PCRE gets a block of memory at the start of matching to use for this
- purpose. If the call via pcre_malloc() fails, this error is given. The
+ PCRE gets a block of memory at the start of matching to use for this
+ purpose. If the call via pcre_malloc() fails, this error is given. The
memory is automatically freed at the end of matching.
- This error is also given if pcre_stack_malloc() fails in pcre_exec().
- This can happen only when PCRE has been compiled with --disable-stack-
+ This error is also given if pcre_stack_malloc() fails in pcre_exec().
+ This can happen only when PCRE has been compiled with --disable-stack-
for-recursion.
PCRE_ERROR_NOSUBSTRING (-7)
- This error is used by the pcre_copy_substring(), pcre_get_substring(),
+ This error is used by the pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() functions (see below). It is never
returned by pcre_exec().
PCRE_ERROR_MATCHLIMIT (-8)
- The backtracking limit, as specified by the match_limit field in a
- pcre_extra structure (or defaulted) was reached. See the description
+ The backtracking limit, as specified by the match_limit field in a
+ pcre_extra structure (or defaulted) was reached. See the description
above.
PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec() itself. It is provided for
- use by callout functions that want to yield a distinctive error code.
+ use by callout functions that want to yield a distinctive error code.
See the pcrecallout documentation for details.
PCRE_ERROR_BADUTF8 (-10)
- A string that contains an invalid UTF-8 byte sequence was passed as a
- subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
- the output vector (ovecsize) is at least 2, the byte offset to the
- start of the the invalid UTF-8 character is placed in the first ele-
- ment, and a reason code is placed in the second element. The reason
+ A string that contains an invalid UTF-8 byte sequence was passed as a
+ subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
+ the output vector (ovecsize) is at least 2, the byte offset to the
+ start of the the invalid UTF-8 character is placed in the first ele-
+ ment, and a reason code is placed in the second element. The reason
codes are listed in the following section. For backward compatibility,
- if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
- acter at the end of the subject (reason codes 1 to 5),
+ if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
+ acter at the end of the subject (reason codes 1 to 5),
PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
PCRE_ERROR_BADUTF8_OFFSET (-11)
- The UTF-8 byte sequence that was passed as a subject was checked and
- found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
- value of startoffset did not point to the beginning of a UTF-8 charac-
+ The UTF-8 byte sequence that was passed as a subject was checked and
+ found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
+ value of startoffset did not point to the beginning of a UTF-8 charac-
ter or the end of the subject.
PCRE_ERROR_PARTIAL (-12)
- The subject string did not match, but it did match partially. See the
+ The subject string did not match, but it did match partially. See the
pcrepartial documentation for details of partial matching.
PCRE_ERROR_BADPARTIAL (-13)
- This code is no longer in use. It was formerly returned when the
- PCRE_PARTIAL option was used with a compiled pattern containing items
- that were not supported for partial matching. From release 8.00
+ This code is no longer in use. It was formerly returned when the
+ PCRE_PARTIAL option was used with a compiled pattern containing items
+ that were not supported for partial matching. From release 8.00
onwards, there are no restrictions on partial matching.
PCRE_ERROR_INTERNAL (-14)
- An unexpected internal error has occurred. This error could be caused
+ An unexpected internal error has occurred. This error could be caused
by a bug in PCRE or by overwriting of the compiled pattern.
PCRE_ERROR_BADCOUNT (-15)
@@ -3604,7 +3619,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_RECURSIONLIMIT (-21)
The internal recursion limit, as specified by the match_limit_recursion
- field in a pcre_extra structure (or defaulted) was reached. See the
+ field in a pcre_extra structure (or defaulted) was reached. See the
description above.
PCRE_ERROR_BADNEWLINE (-23)
@@ -3618,29 +3633,29 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_SHORTUTF8 (-25)
- This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
- string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
- option is set. Information about the failure is returned as for
- PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
- this special error code for PCRE_PARTIAL_HARD precedes the implementa-
- tion of returned information; it is retained for backwards compatibil-
+ This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
+ string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
+ option is set. Information about the failure is returned as for
+ PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
+ this special error code for PCRE_PARTIAL_HARD precedes the implementa-
+ tion of returned information; it is retained for backwards compatibil-
ity.
PCRE_ERROR_RECURSELOOP (-26)
This error is returned when pcre_exec() detects a recursion loop within
- the pattern. Specifically, it means that either the whole pattern or a
- subpattern has been called recursively for the second time at the same
+ the pattern. Specifically, it means that either the whole pattern or a
+ subpattern has been called recursively for the second time at the same
position in the subject string. Some simple patterns that might do this
- are detected and faulted at compile time, but more complicated cases,
+ are detected and faulted at compile time, but more complicated cases,
in particular mutual recursions between two different subpatterns, can-
not be detected until run time.
PCRE_ERROR_JIT_STACKLIMIT (-27)
- This error is returned when a pattern that was successfully studied
- using a JIT compile option is being matched, but the memory available
- for the just-in-time processing stack is not large enough. See the
+ This error is returned when a pattern that was successfully studied
+ using a JIT compile option is being matched, but the memory available
+ for the just-in-time processing stack is not large enough. See the
pcrejit documentation for more details.
PCRE_ERROR_BADMODE (-28)
@@ -3650,38 +3665,38 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_BADENDIANNESS (-29)
- This error is given if a pattern that was compiled and saved is
- reloaded on a host with different endianness. The utility function
+ This error is given if a pattern that was compiled and saved is
+ reloaded on a host with different endianness. The utility function
pcre_pattern_to_host_byte_order() can be used to convert such a pattern
so that it runs on the new host.
PCRE_ERROR_JIT_BADOPTION
- This error is returned when a pattern that was successfully studied
- using a JIT compile option is being matched, but the matching mode
- (partial or complete match) does not correspond to any JIT compilation
- mode. When the JIT fast path function is used, this error may be also
- given for invalid options. See the pcrejit documentation for more
+ This error is returned when a pattern that was successfully studied
+ using a JIT compile option is being matched, but the matching mode
+ (partial or complete match) does not correspond to any JIT compilation
+ mode. When the JIT fast path function is used, this error may be also
+ given for invalid options. See the pcrejit documentation for more
details.
PCRE_ERROR_BADLENGTH (-32)
- This error is given if pcre_exec() is called with a negative value for
+ This error is given if pcre_exec() is called with a negative value for
the length argument.
Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
Reason codes for invalid UTF-8 strings
- This section applies only to the 8-bit library. The corresponding
- information for the 16-bit and 32-bit libraries is given in the pcre16
+ This section applies only to the 8-bit library. The corresponding
+ information for the 16-bit and 32-bit libraries is given in the pcre16
and pcre32 pages.
When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
- UTF8, and the size of the output vector (ovecsize) is at least 2, the
- offset of the start of the invalid UTF-8 character is placed in the
+ UTF8, and the size of the output vector (ovecsize) is at least 2, the
+ offset of the start of the invalid UTF-8 character is placed in the
first output vector element (ovector[0]) and a reason code is placed in
- the second element (ovector[1]). The reason codes are given names in
+ the second element (ovector[1]). The reason codes are given names in
the pcre.h header file:
PCRE_UTF8_ERR1
@@ -3690,10 +3705,10 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_UTF8_ERR4
PCRE_UTF8_ERR5
- The string ends with a truncated UTF-8 character; the code specifies
- how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
- characters to be no longer than 4 bytes, the encoding scheme (origi-
- nally defined by RFC 2279) allows for up to 6 bytes, and this is
+ The string ends with a truncated UTF-8 character; the code specifies
+ how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
+ characters to be no longer than 4 bytes, the encoding scheme (origi-
+ nally defined by RFC 2279) allows for up to 6 bytes, and this is
checked first; hence the possibility of 4 or 5 missing bytes.
PCRE_UTF8_ERR6
@@ -3703,24 +3718,24 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_UTF8_ERR10
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
- the character do not have the binary value 0b10 (that is, either the
+ the character do not have the binary value 0b10 (that is, either the
most significant bit is 0, or the next bit is 1).
PCRE_UTF8_ERR11
PCRE_UTF8_ERR12
- A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
+ A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
long; these code points are excluded by RFC 3629.
PCRE_UTF8_ERR13
- A 4-byte character has a value greater than 0x10fff; these code points
+ A 4-byte character has a value greater than 0x10fff; these code points
are excluded by RFC 3629.
PCRE_UTF8_ERR14
- A 3-byte character has a value in the range 0xd800 to 0xdfff; this
- range of code points are reserved by RFC 3629 for use with UTF-16, and
+ A 3-byte character has a value in the range 0xd800 to 0xdfff; this
+ range of code points are reserved by RFC 3629 for use with UTF-16, and
so are excluded from UTF-8.
PCRE_UTF8_ERR15
@@ -3729,28 +3744,28 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_UTF8_ERR18
PCRE_UTF8_ERR19
- A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
- for a value that can be represented by fewer bytes, which is invalid.
- For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
+ A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
+ for a value that can be represented by fewer bytes, which is invalid.
+ For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
rect coding uses just one byte.
PCRE_UTF8_ERR20
The two most significant bits of the first byte of a character have the
- binary value 0b10 (that is, the most significant bit is 1 and the sec-
- ond is 0). Such a byte can only validly occur as the second or subse-
+ binary value 0b10 (that is, the most significant bit is 1 and the sec-
+ ond is 0). Such a byte can only validly occur as the second or subse-
quent byte of a multi-byte character.
PCRE_UTF8_ERR21
- The first byte of a character has the value 0xfe or 0xff. These values
+ The first byte of a character has the value 0xfe or 0xff. These values
can never occur in a valid UTF-8 string.
PCRE_UTF8_ERR22
- This error code was formerly used when the presence of a so-called
- "non-character" caused an error. Unicode corrigendum #9 makes it clear
- that such characters should not cause a string to be rejected, and so
+ This error code was formerly used when the presence of a so-called
+ "non-character" caused an error. Unicode corrigendum #9 makes it clear
+ that such characters should not cause a string to be rejected, and so
this code is no longer in use and is never returned.
@@ -3767,78 +3782,78 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
- Captured substrings can be accessed directly by using the offsets
- returned by pcre_exec() in ovector. For convenience, the functions
+ Captured substrings can be accessed directly by using the offsets
+ returned by pcre_exec() in ovector. For convenience, the functions
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
- string_list() are provided for extracting captured substrings as new,
- separate, zero-terminated strings. These functions identify substrings
- by number. The next section describes functions for extracting named
+ string_list() are provided for extracting captured substrings as new,
+ separate, zero-terminated strings. These functions identify substrings
+ by number. The next section describes functions for extracting named
substrings.
- A substring that contains a binary zero is correctly extracted and has
- a further zero added on the end, but the result is not, of course, a C
- string. However, you can process such a string by referring to the
- length that is returned by pcre_copy_substring() and pcre_get_sub-
+ A substring that contains a binary zero is correctly extracted and has
+ a further zero added on the end, but the result is not, of course, a C
+ string. However, you can process such a string by referring to the
+ length that is returned by pcre_copy_substring() and pcre_get_sub-
string(). Unfortunately, the interface to pcre_get_substring_list() is
- not adequate for handling strings containing binary zeros, because the
+ not adequate for handling strings containing binary zeros, because the
end of the final string is not independently indicated.
- The first three arguments are the same for all three of these func-
- tions: subject is the subject string that has just been successfully
+ The first three arguments are the same for all three of these func-
+ tions: subject is the subject string that has just been successfully
matched, ovector is a pointer to the vector of integer offsets that was
passed to pcre_exec(), and stringcount is the number of substrings that
- were captured by the match, including the substring that matched the
+ were captured by the match, including the substring that matched the
entire regular expression. This is the value returned by pcre_exec() if
- it is greater than zero. If pcre_exec() returned zero, indicating that
- it ran out of space in ovector, the value passed as stringcount should
+ it is greater than zero. If pcre_exec() returned zero, indicating that
+ it ran out of space in ovector, the value passed as stringcount should
be the number of elements in the vector divided by three.
- The functions pcre_copy_substring() and pcre_get_substring() extract a
- single substring, whose number is given as stringnumber. A value of
- zero extracts the substring that matched the entire pattern, whereas
- higher values extract the captured substrings. For pcre_copy_sub-
- string(), the string is placed in buffer, whose length is given by
- buffersize, while for pcre_get_substring() a new block of memory is
- obtained via pcre_malloc, and its address is returned via stringptr.
- The yield of the function is the length of the string, not including
+ The functions pcre_copy_substring() and pcre_get_substring() extract a
+ single substring, whose number is given as stringnumber. A value of
+ zero extracts the substring that matched the entire pattern, whereas
+ higher values extract the captured substrings. For pcre_copy_sub-
+ string(), the string is placed in buffer, whose length is given by
+ buffersize, while for pcre_get_substring() a new block of memory is
+ obtained via pcre_malloc, and its address is returned via stringptr.
+ The yield of the function is the length of the string, not including
the terminating zero, or one of these error codes:
PCRE_ERROR_NOMEMORY (-6)
- The buffer was too small for pcre_copy_substring(), or the attempt to
+ The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber.
- The pcre_get_substring_list() function extracts all available sub-
- strings and builds a list of pointers to them. All this is done in a
+ The pcre_get_substring_list() function extracts all available sub-
+ strings and builds a list of pointers to them. All this is done in a
single block of memory that is obtained via pcre_malloc. The address of
- the memory block is returned via listptr, which is also the start of
- the list of string pointers. The end of the list is marked by a NULL
- pointer. The yield of the function is zero if all went well, or the
+ the memory block is returned via listptr, which is also the start of
+ the list of string pointers. The end of the list is marked by a NULL
+ pointer. The yield of the function is zero if all went well, or the
error code
PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed.
- When any of these functions encounter a substring that is unset, which
- can happen when capturing subpattern number n+1 matches some part of
- the subject, but subpattern n has not been used at all, they return an
+ When any of these functions encounter a substring that is unset, which
+ can happen when capturing subpattern number n+1 matches some part of
+ the subject, but subpattern n has not been used at all, they return an
empty string. This can be distinguished from a genuine zero-length sub-
- string by inspecting the appropriate offset in ovector, which is nega-
+ string by inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
- The two convenience functions pcre_free_substring() and pcre_free_sub-
- string_list() can be used to free the memory returned by a previous
+ The two convenience functions pcre_free_substring() and pcre_free_sub-
+ string_list() can be used to free the memory returned by a previous
call of pcre_get_substring() or pcre_get_substring_list(), respec-
- tively. They do nothing more than call the function pointed to by
- pcre_free, which of course could be called directly from a C program.
- However, PCRE is used in some situations where it is linked via a spe-
- cial interface to another programming language that cannot use
- pcre_free directly; it is for these cases that the functions are pro-
+ tively. They do nothing more than call the function pointed to by
+ pcre_free, which of course could be called directly from a C program.
+ However, PCRE is used in some situations where it is linked via a spe-
+ cial interface to another programming language that cannot use
+ pcre_free directly; it is for these cases that the functions are pro-
vided.
@@ -3857,7 +3872,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
int stringcount, const char *stringname,
const char **stringptr);
- To extract a substring by name, you first have to find associated num-
+ To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern
(a+)b(?<xxx>\d+)...
@@ -3866,35 +3881,35 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
be unique (PCRE_DUPNAMES was not set), you can find the number from the
name by calling pcre_get_stringnumber(). The first argument is the com-
piled pattern, and the second is the name. The yield of the function is
- the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
+ the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
subpattern of that name.
Given the number, you can extract the substring directly, or use one of
the functions described in the previous section. For convenience, there
are also two functions that do the whole job.
- Most of the arguments of pcre_copy_named_substring() and
- pcre_get_named_substring() are the same as those for the similarly
- named functions that extract by number. As these are described in the
- previous section, they are not re-described here. There are just two
+ Most of the arguments of pcre_copy_named_substring() and
+ pcre_get_named_substring() are the same as those for the similarly
+ named functions that extract by number. As these are described in the
+ previous section, they are not re-described here. There are just two
differences:
- First, instead of a substring number, a substring name is given. Sec-
+ First, instead of a substring number, a substring name is given. Sec-
ond, there is an extra argument, given at the start, which is a pointer
- to the compiled pattern. This is needed in order to gain access to the
+ to the compiled pattern. This is needed in order to gain access to the
name-to-number translation table.
- These functions call pcre_get_stringnumber(), and if it succeeds, they
- then call pcre_copy_substring() or pcre_get_substring(), as appropri-
- ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
+ These functions call pcre_get_stringnumber(), and if it succeeds, they
+ then call pcre_copy_substring() or pcre_get_substring(), as appropri-
+ ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
behaviour may not be what you want (see the next section).
Warning: If the pattern uses the (?| feature to set up multiple subpat-
- terns with the same number, as described in the section on duplicate
- subpattern numbers in the pcrepattern page, you cannot use names to
- distinguish the different subpatterns, because names are not included
- in the compiled code. The matching process uses only numbers. For this
- reason, the use of different names for subpatterns of the same number
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcrepattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
causes an error at compile time.
@@ -3903,76 +3918,76 @@ DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);
- When a pattern is compiled with the PCRE_DUPNAMES option, names for
- subpatterns are not required to be unique. (Duplicate names are always
- allowed for subpatterns with the same number, created by using the (?|
- feature. Indeed, if such subpatterns are named, they are required to
+ When a pattern is compiled with the PCRE_DUPNAMES option, names for
+ subpatterns are not required to be unique. (Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
use the same names.)
Normally, patterns with duplicate names are such that in any one match,
- only one of the named subpatterns participates. An example is shown in
+ only one of the named subpatterns participates. An example is shown in
the pcrepattern documentation.
- When duplicates are present, pcre_copy_named_substring() and
- pcre_get_named_substring() return the first substring corresponding to
- the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
- (-7) is returned; no data is returned. The pcre_get_stringnumber()
- function returns one of the numbers that are associated with the name,
+ When duplicates are present, pcre_copy_named_substring() and
+ pcre_get_named_substring() return the first substring corresponding to
+ the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
+ (-7) is returned; no data is returned. The pcre_get_stringnumber()
+ function returns one of the numbers that are associated with the name,
but it is not defined which it is.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre_get_stringtable_entries() function. The
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre_get_stringtable_entries() function. The
first argument is the compiled pattern, and the second is the name. The
- third and fourth are pointers to variables which are updated by the
+ third and fourth are pointers to variables which are updated by the
function. After it has run, they point to the first and last entries in
- the name-to-number table for the given name. The function itself
- returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
- there are none. The format of the table is described above in the sec-
- tion entitled Information about a pattern above. Given all the rele-
- vant entries for the name, you can extract each of their numbers, and
+ the name-to-number table for the given name. The function itself
+ returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
+ there are none. The format of the table is described above in the sec-
+ tion entitled Information about a pattern above. Given all the rele-
+ vant entries for the name, you can extract each of their numbers, and
hence the captured data, if any.
FINDING ALL POSSIBLE MATCHES
- The traditional matching function uses a similar algorithm to Perl,
+ The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in
- the subject. If you want to find all possible matches, or the longest
- possible match, consider using the alternative matching function (see
- below) instead. If you cannot use the alternative function, but still
- need to find all possible matches, you can kludge it up by making use
+ the subject. If you want to find all possible matches, or the longest
+ possible match, consider using the alternative matching function (see
+ below) instead. If you cannot use the alternative function, but still
+ need to find all possible matches, you can kludge it up by making use
of the callout facility, which is described in the pcrecallout documen-
tation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre_exec() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre_exec() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
OBTAINING AN ESTIMATE OF STACK USAGE
- Matching certain patterns using pcre_exec() can use a lot of process
- stack, which in certain environments can be rather limited in size.
- Some users find it helpful to have an estimate of the amount of stack
- that is used by pcre_exec(), to help them set recursion limits, as
- described in the pcrestack documentation. The estimate that is output
+ Matching certain patterns using pcre_exec() can use a lot of process
+ stack, which in certain environments can be rather limited in size.
+ Some users find it helpful to have an estimate of the amount of stack
+ that is used by pcre_exec(), to help them set recursion limits, as
+ described in the pcrestack documentation. The estimate that is output
by pcretest when called with the -m and -C options is obtained by call-
- ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
+ ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
first five arguments.
- Normally, if its first argument is NULL, pcre_exec() immediately
- returns the negative error code PCRE_ERROR_NULL, but with this special
- combination of arguments, it returns instead a negative number whose
- absolute value is the approximate stack frame size in bytes. (A nega-
- tive number is used so that it is clear that no match has happened.)
- The value is approximate because in some cases, recursive calls to
+ Normally, if its first argument is NULL, pcre_exec() immediately
+ returns the negative error code PCRE_ERROR_NULL, but with this special
+ combination of arguments, it returns instead a negative number whose
+ absolute value is the approximate stack frame size in bytes. (A nega-
+ tive number is used so that it is clear that no match has happened.)
+ The value is approximate because in some cases, recursive calls to
pcre_exec() occur when there are one or two additional variables on the
stack.
- If PCRE has been compiled to use the heap instead of the stack for
- recursion, the value returned is the size of each block that is
+ If PCRE has been compiled to use the heap instead of the stack for
+ recursion, the value returned is the size of each block that is
obtained from the heap.
@@ -3983,26 +3998,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
- The function pcre_dfa_exec() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
- subject string just once, and does not backtrack. This has different
- characteristics to the normal algorithm, and is not compatible with
- Perl. Some of the features of PCRE patterns are not supported. Never-
- theless, there are times when this kind of matching can be useful. For
- a discussion of the two matching algorithms, and a list of features
- that pcre_dfa_exec() does not support, see the pcrematching documenta-
+ The function pcre_dfa_exec() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
+ subject string just once, and does not backtrack. This has different
+ characteristics to the normal algorithm, and is not compatible with
+ Perl. Some of the features of PCRE patterns are not supported. Never-
+ theless, there are times when this kind of matching can be useful. For
+ a discussion of the two matching algorithms, and a list of features
+ that pcre_dfa_exec() does not support, see the pcrematching documenta-
tion.
- The arguments for the pcre_dfa_exec() function are the same as for
+ The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ-
- ent way, and this is described below. The other common arguments are
- used in the same way as for pcre_exec(), so their description is not
+ ent way, and this is described below. The other common arguments are
+ used in the same way as for pcre_exec(), so their description is not
repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace will be needed for patterns and subjects where there are a
+ workspace will be needed for patterns and subjects where there are a
lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec():
@@ -4024,55 +4039,55 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_exec()
- The unused bits of the options argument for pcre_dfa_exec() must be
- zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
+ The unused bits of the options argument for pcre_dfa_exec() must be
+ zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
- PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
- PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
- TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
- four of these are exactly the same as for pcre_exec(), so their
+ PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
+ PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
+ TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
+ four of these are exactly the same as for pcre_exec(), so their
description is not repeated here.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
- These have the same general effect as they do for pcre_exec(), but the
- details are slightly different. When PCRE_PARTIAL_HARD is set for
- pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
- ject is reached and there is still at least one matching possibility
+ These have the same general effect as they do for pcre_exec(), but the
+ details are slightly different. When PCRE_PARTIAL_HARD is set for
+ pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
+ ject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
- of the subject is reached, there have been no complete matches, but
- there is still at least one matching possibility. The portion of the
- string that was inspected when the longest partial match was found is
- set as the first matching string in both cases. There is a more
- detailed discussion of partial and multi-segment matching, with exam-
+ of the subject is reached, there have been no complete matches, but
+ there is still at least one matching possibility. The portion of the
+ string that was inspected when the longest partial match was found is
+ set as the first matching string in both cases. There is a more
+ detailed discussion of partial and multi-segment matching, with exam-
ples, in the pcrepartial documentation.
PCRE_DFA_SHORTEST
- Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
+ Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
- tive algorithm works, this is necessarily the shortest possible match
+ tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE_DFA_RESTART
When pcre_dfa_exec() returns a partial match, it is possible to call it
- again, with additional subject characters, and have it continue with
- the same match. The PCRE_DFA_RESTART option requests this action; when
- it is set, the workspace and wscount options must reference the same
- vector as before because data about the match so far is left in them
+ again, with additional subject characters, and have it continue with
+ the same match. The PCRE_DFA_RESTART option requests this action; when
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcrepartial documentation.
Successful returns from pcre_dfa_exec()
- When pcre_dfa_exec() succeeds, it may have matched more than one sub-
+ When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -4087,70 +4102,70 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else>
<something> <something else> <something further>
- On success, the yield of the function is a number greater than zero,
- which is the number of matched substrings. The substrings themselves
- are returned in ovector. Each string uses two elements; the first is
- the offset to the start, and the second is the offset to the end. In
- fact, all the strings have the same start offset. (Space could have
- been saved by giving this only once, but it was decided to retain some
- compatibility with the way pcre_exec() returns data, even though the
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The substrings themselves
+ are returned in ovector. Each string uses two elements; the first is
+ the offset to the start, and the second is the offset to the end. In
+ fact, all the strings have the same start offset. (Space could have
+ been saved by giving this only once, but it was decided to retain some
+ compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long-
- est matching string is given first. If there were too many matches to
- fit into ovector, the yield of the function is zero, and the vector is
- filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
+ est matching string is given first. If there were too many matches to
+ fit into ovector, the yield of the function is zero, and the vector is
+ filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
can use the entire ovector for returning matched strings.
Error returns from pcre_dfa_exec()
- The pcre_dfa_exec() function returns a negative number when it fails.
- Many of the errors are the same as for pcre_exec(), and these are
- described above. There are in addition the following errors that are
+ The pcre_dfa_exec() function returns a negative number when it fails.
+ Many of the errors are the same as for pcre_exec(), and these are
+ described above. There are in addition the following errors that are
specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16)
- This return is given if pcre_dfa_exec() encounters an item in the pat-
- tern that it does not support, for instance, the use of \C or a back
+ This return is given if pcre_dfa_exec() encounters an item in the pat-
+ tern that it does not support, for instance, the use of \C or a back
reference.
PCRE_ERROR_DFA_UCOND (-17)
- This return is given if pcre_dfa_exec() encounters a condition item
- that uses a back reference for the condition, or a test for recursion
+ This return is given if pcre_dfa_exec() encounters a condition item
+ that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18)
- This return is given if pcre_dfa_exec() is called with an extra block
- that contains a setting of the match_limit or match_limit_recursion
- fields. This is not supported (these fields are meaningless for DFA
+ This return is given if pcre_dfa_exec() is called with an extra block
+ that contains a setting of the match_limit or match_limit_recursion
+ fields. This is not supported (these fields are meaningless for DFA
matching).
PCRE_ERROR_DFA_WSSIZE (-19)
- This return is given if pcre_dfa_exec() runs out of space in the
+ This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.
PCRE_ERROR_DFA_RECURSE (-20)
- When a recursive subpattern is processed, the matching function calls
- itself recursively, using private vectors for ovector and workspace.
- This error is given if the output vector is not large enough. This
+ When a recursive subpattern is processed, the matching function calls
+ itself recursively, using private vectors for ovector and workspace.
+ This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
PCRE_ERROR_DFA_BADRESTART (-30)
- When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
- plausibility checks are made on the contents of the workspace, which
- should contain data about the previous partial match. If any of these
+ When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
+ plausibility checks are made on the contents of the workspace, which
+ should contain data about the previous partial match. If any of these
checks fail, this error is given.
SEE ALSO
- pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
+ pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
sample(3), pcrestack(3).
@@ -4164,7 +4179,7 @@ AUTHOR
REVISION
- Last updated: 26 April 2013
+ Last updated: 12 May 2013
Copyright (c) 1997-2013 University of Cambridge.
------------------------------------------------------------------------------