summaryrefslogtreecommitdiff
path: root/doc/pcreapi.3
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcreapi.3')
-rw-r--r--doc/pcreapi.386
1 files changed, 47 insertions, 39 deletions
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 7145a4e..ac111f1 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "10 May 2013" "PCRE 8.33"
+.TH PCREAPI 3 "12 May 2013" "PCRE 8.33"
.SH NAME
PCRE - Perl-compatible regular expressions
.sp
@@ -161,10 +161,10 @@ by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the
16-bit and 32-bit option names define the same bit values.
.P
References to bytes and UTF-8 in this document should be read as references to
-16-bit data quantities and UTF-16 when using the 16-bit library, or 32-bit data
-quantities and UTF-32 when using the 32-bit library, unless specified
-otherwise. More details of the specific differences for the 16-bit and 32-bit
-libraries are given in the
+16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data
+units and UTF-32 when using the 32-bit library, unless specified otherwise.
+More details of the specific differences for the 16-bit and 32-bit libraries
+are given in the
.\" HREF
\fBpcre16\fP
.\"
@@ -562,15 +562,15 @@ Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns
NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual
error message. This is a static string that is part of the library. You must
not try to free it. Normally, the offset from the start of the pattern to the
-byte that was being processed when the error was discovered is placed in the
-variable pointed to by \fIerroffset\fP, which must not be NULL (if it is, an
-immediate error is given). However, for an invalid UTF-8 string, the offset is
-that of the first byte of the failing character.
+data unit that was being processed when the error was discovered is placed in
+the variable pointed to by \fIerroffset\fP, which must not be NULL (if it is,
+an immediate error is given). However, for an invalid UTF-8 or UTF-16 string,
+the offset is that of the first data unit of the failing character.
.P
Some errors are not detected until the whole pattern has been scanned; in these
cases, the offset passed back is the length of the pattern. Note that the
-offset is in bytes, not characters, even in UTF-8 mode. It may sometimes point
-into the middle of a UTF-8 character.
+offset is in data units, not characters, even in a UTF mode. It may sometimes
+point into the middle of a UTF-8 or UTF-16 character.
.P
If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the
\fIerrorcodeptr\fP argument is not NULL, a non-zero error code number is
@@ -1323,7 +1323,7 @@ call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
.sp
PCRE_INFO_MAXLOOKBEHIND
.sp
-Return the number of characters (NB not bytes) in the longest lookbehind
+Return the number of characters (NB not data units) in the longest lookbehind
assertion in the pattern. This information is useful when doing multi-segment
matching using the partial matching facilities. Note that the simple assertions
\eb and \eB require a one-character lookbehind. \eA also registers a
@@ -1337,11 +1337,11 @@ segment.
.sp
If the pattern was studied and a minimum length for matching subject strings
was computed, its value is returned. Otherwise the returned value is -1. The
-value is a number of characters, which in UTF-8 mode may be different from the
-number of bytes. The fourth argument should point to an \fBint\fP variable. A
-non-negative value is a lower bound to the length of any matching string. There
-may not be any strings of that length that do actually match, but every string
-that does match is at least that long.
+value is a number of characters, which in UTF mode may be different from the
+number of data units. The fourth argument should point to an \fBint\fP
+variable. A non-negative value is a lower bound to the length of any matching
+string. There may not be any strings of that length that do actually match, but
+every string that does match is at least that long.
.sp
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
@@ -1364,10 +1364,10 @@ length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
entry of the table. This is a pointer to \fBchar\fP in the 8-bit library, where
the first two bytes of each entry are the number of the capturing parenthesis,
most significant byte first. In the 16-bit library, the pointer points to
-16-bit data units, the first of which contains the parenthesis number.
-In the 32-bit library, the pointer points to 32-bit data units, the first of
-which contains the parenthesis number. The rest
-of the entry is the corresponding name, zero terminated.
+16-bit data units, the first of which contains the parenthesis number. In the
+32-bit library, the pointer points to 32-bit data units, the first of which
+contains the parenthesis number. The rest of the entry is the corresponding
+name, zero terminated.
.P
The names are in alphabetical order. Duplicate names may appear if (?| is used
to create multiple groups with the same number, as described in the
@@ -1449,7 +1449,7 @@ set, the call to \fBpcre_fullinfo()\fP returns the error PCRE_ERROR_UNSET.
.sp
PCRE_INFO_SIZE
.sp
-Return the size of the compiled pattern in bytes (for both libraries). The
+Return the size of the compiled pattern in bytes (for all three libraries). The
fourth argument should point to a \fBsize_t\fP variable. This value does not
include the size of the \fBpcre\fP structure that is returned by
\fBpcre_compile()\fP. The value that is passed as the argument to
@@ -1460,11 +1460,12 @@ does not alter the value returned by this option.
.sp
PCRE_INFO_STUDYSIZE
.sp
-Return the size in bytes of the data block pointed to by the \fIstudy_data\fP
-field in a \fBpcre_extra\fP block. If \fBpcre_extra\fP is NULL, or there is no
-study data, zero is returned. The fourth argument should point to a
-\fBsize_t\fP variable. The \fIstudy_data\fP field is set by \fBpcre_study()\fP
-to record information that will speed up matching (see the section entitled
+Return the size in bytes (for all three libraries) of the data block pointed to
+by the \fIstudy_data\fP field in a \fBpcre_extra\fP block. If \fBpcre_extra\fP
+is NULL, or there is no study data, zero is returned. The fourth argument
+should point to a \fBsize_t\fP variable. The \fIstudy_data\fP field is set by
+\fBpcre_study()\fP to record information that will speed up matching (see the
+section entitled
.\" HTML <a href="#studyingapattern">
.\" </a>
"Studying a pattern"
@@ -1993,13 +1994,18 @@ documentation.
.rs
.sp
The subject string is passed to \fBpcre_exec()\fP as a pointer in
-\fIsubject\fP, a length in bytes in \fIlength\fP, and a starting byte offset
-in \fIstartoffset\fP. If this is negative or greater than the length of the
-subject, \fBpcre_exec()\fP returns PCRE_ERROR_BADOFFSET. When the starting
-offset is zero, the search for a match starts at the beginning of the subject,
-and this is by far the most common case. In UTF-8 mode, the byte offset must
-point to the start of a UTF-8 character (or the end of the subject). Unlike the
-pattern string, the subject may contain binary zero bytes.
+\fIsubject\fP, a length in \fIlength\fP, and a starting offset in
+\fIstartoffset\fP. The units for \fIlength\fP and \fIstartoffset\fP are bytes
+for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit
+data items for the 32-bit library.
+.P
+If \fIstartoffset\fP is negative or greater than the length of the subject,
+\fBpcre_exec()\fP returns PCRE_ERROR_BADOFFSET. When the starting offset is
+zero, the search for a match starts at the beginning of the subject, and this
+is by far the most common case. In UTF-8 or UTF-16 mode, the offset must point
+to the start of a character, or the end of the subject (in UTF-32 mode, one
+data unit equals one character, so all offsets are valid). Unlike the pattern
+string, the subject may contain binary zeroes.
.P
A non-zero starting offset is useful when searching for another match in the
same subject by calling \fBpcre_exec()\fP again after a previous success.
@@ -2063,10 +2069,12 @@ rounded down.
When a match is successful, information about captured substrings is returned
in pairs of integers, starting at the beginning of \fIovector\fP, and
continuing up to two-thirds of its length at the most. The first element of
-each pair is set to the byte offset of the first character in a substring, and
-the second is set to the byte offset of the first character after the end of a
-substring. \fBNote\fP: these values are always byte offsets, even in UTF-8
-mode. They are not character counts.
+each pair is set to the offset of the first character in a substring, and the
+second is set to the offset of the first character after the end of a
+substring. These values are always data unit offsets, even in UTF mode. They
+are byte offsets in the 8-bit library, 16-bit data item offsets in the 16-bit
+library, and 32-bit data item offsets in the 32-bit library. \fBNote\fP: they
+are not character counts.
.P
The first pair of integers, \fIovector[0]\fP and \fIovector[1]\fP, identify the
portion of the subject string matched by the entire pattern. The next pair is
@@ -2878,6 +2886,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 10 May 2013
+Last updated: 12 May 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi