summaryrefslogtreecommitdiff
path: root/srclib/pcre/doc/pcretest.txt
diff options
context:
space:
mode:
Diffstat (limited to 'srclib/pcre/doc/pcretest.txt')
-rw-r--r--srclib/pcre/doc/pcretest.txt531
1 files changed, 317 insertions, 214 deletions
diff --git a/srclib/pcre/doc/pcretest.txt b/srclib/pcre/doc/pcretest.txt
index 831fdac987..0e13b6c6c5 100644
--- a/srclib/pcre/doc/pcretest.txt
+++ b/srclib/pcre/doc/pcretest.txt
@@ -1,216 +1,319 @@
-The pcretest program
---------------------
+NAME
+ pcretest - a program for testing Perl-compatible regular
+ expressions.
-This program is intended for testing PCRE, but it can also be used for
-experimenting with regular expressions.
-If it is given two filename arguments, it reads from the first and writes to
-the second. If it is given only one filename argument, it reads from that file
-and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and
-prompts for each line of input, using "re>" to prompt for regular expressions,
-and "data>" to prompt for data lines.
-
-The program handles any number of sets of input on a single input file. Each
-set starts with a regular expression, and continues with any number of data
-lines to be matched against the pattern. An empty line signals the end of the
-data lines, at which point a new regular expression is read. The regular
-expressions are given enclosed in any non-alphameric delimiters other than
-backslash, for example
-
- /(a|bc)x+yz/
-
-White space before the initial delimiter is ignored. A regular expression may
-be continued over several input lines, in which case the newline characters are
-included within it. See the test input files in the testdata directory for many
-examples. It is possible to include the delimiter within the pattern by
-escaping it, for example
-
- /abc\/def/
-
-If you do so, the escape and the delimiter form part of the pattern, but since
-delimiters are always non-alphameric, this does not affect its interpretation.
-If the terminating delimiter is immediately followed by a backslash, for
-example,
-
- /abc/\
-
-then a backslash is added to the end of the pattern. This is done to provide a
-way of testing the error condition that arises if a pattern finishes with a
-backslash, because
-
- /abc\/
-
-is interpreted as the first line of a pattern that starts with "abc/", causing
-pcretest to read the next line as a continuation of the regular expression.
-
-The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
-PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
-example:
-
- /caseless/i
-
-These modifier letters have the same effect as they do in Perl. There are
-others which set PCRE options that do not correspond to anything in Perl: /A,
-/E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
-
-Searching for all possible matches within each subject string can be requested
-by the /g or /G modifier. After finding a match, PCRE is called again to search
-the remainder of the subject string. The difference between /g and /G is that
-the former uses the startoffset argument to pcre_exec() to start searching at
-a new point within the entire string (which is in effect what Perl does),
-whereas the latter passes over a shortened substring. This makes a difference
-to the matching process if the pattern begins with a lookbehind assertion
-(including \b or \B).
-
-If any call to pcre_exec() in a /g or /G sequence matches an empty string, the
-next call is done with the PCRE_NOTEMPTY flag set so that it cannot match an
-empty string again at the same point. If however, this second match fails, the
-start offset is advanced by one, and the match is retried. This imitates the
-way Perl handles such cases when using the /g modifier or the split() function.
-
-There are a number of other modifiers for controlling the way pcretest
-operates.
-
-The /+ modifier requests that as well as outputting the substring that matched
-the entire pattern, pcretest should in addition output the remainder of the
-subject string. This is useful for tests where the subject contains multiple
-copies of the same substring.
-
-The /L modifier must be followed directly by the name of a locale, for example,
-
- /pattern/Lfr
-
-For this reason, it must be the last modifier letter. The given locale is set,
-pcre_maketables() is called to build a set of character tables for the locale,
-and this is then passed to pcre_compile() when compiling the regular
-expression. Without an /L modifier, NULL is passed as the tables pointer; that
-is, /L applies only to the expression on which it appears.
-
-The /I modifier requests that pcretest output information about the compiled
-expression (whether it is anchored, has a fixed first character, and so on). It
-does this by calling pcre_fullinfo() after compiling an expression, and
-outputting the information it gets back. If the pattern is studied, the results
-of that are also output.
-
-The /D modifier is a PCRE debugging feature, which also assumes /I. It causes
-the internal form of compiled regular expressions to be output after
-compilation.
-
-The /S modifier causes pcre_study() to be called after the expression has been
-compiled, and the results used when the expression is matched.
-
-The /M modifier causes the size of memory block used to hold the compiled
-pattern to be output.
-
-Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API
-rather than its native API. When this is done, all other modifiers except /i,
-/m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is
-set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,
-and PCRE_DOTALL unless REG_NEWLINE is set.
-
-Before each data line is passed to pcre_exec(), leading and trailing whitespace
-is removed, and it is then scanned for \ escapes. The following are recognized:
-
- \a alarm (= BEL)
- \b backspace
- \e escape
- \f formfeed
- \n newline
- \r carriage return
- \t tab
- \v vertical tab
- \nnn octal character (up to 3 octal digits)
- \xhh hexadecimal character (up to 2 hex digits)
-
- \A pass the PCRE_ANCHORED option to pcre_exec()
- \B pass the PCRE_NOTBOL option to pcre_exec()
- \Cdd call pcre_copy_substring() for substring dd after a successful match
- (any decimal number less than 32)
- \Gdd call pcre_get_substring() for substring dd after a successful match
- (any decimal number less than 32)
- \L call pcre_get_substringlist() after a successful match
- \N pass the PCRE_NOTEMPTY option to pcre_exec()
- \Odd set the size of the output vector passed to pcre_exec() to dd
- (any number of decimal digits)
- \Z pass the PCRE_NOTEOL option to pcre_exec()
-
-A backslash followed by anything else just escapes the anything else. If the
-very last character is a backslash, it is ignored. This gives a way of passing
-an empty line as data, since a real empty line terminates the data input.
-
-If /P was present on the regex, causing the POSIX wrapper API to be used, only
-\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
-regexec() respectively.
-
-When a match succeeds, pcretest outputs the list of captured substrings that
-pcre_exec() returns, starting with number 0 for the string that matched the
-whole pattern. Here is an example of an interactive pcretest run.
-
- $ pcretest
- PCRE version 2.06 08-Jun-1999
-
- re> /^abc(\d+)/
- data> abc123
- 0: abc123
- 1: 123
- data> xyz
- No match
-
-If the strings contain any non-printing characters, they are output as \0x
-escapes. If the pattern has the /+ modifier, then the output for substring 0 is
-followed by the the rest of the subject string, identified by "0+" like this:
-
- re> /cat/+
- data> cataract
- 0: cat
- 0+ aract
-
-If the pattern has the /g or /G modifier, the results of successive matching
-attempts are output in sequence, like this:
-
- re> /\Bi(\w\w)/g
- data> Mississippi
- 0: iss
- 1: ss
- 0: iss
- 1: ss
- 0: ipp
- 1: pp
-
-"No match" is output only if the first match attempt fails.
-
-If any of \C, \G, or \L are present in a data line that is successfully
-matched, the substrings extracted by the convenience functions are output with
-C, G, or L after the string number instead of a colon. This is in addition to
-the normal full list. The string length (that is, the return from the
-extraction function) is given in parentheses after each string for \C and \G.
-
-Note that while patterns can be continued over several lines (a plain ">"
-prompt is used for continuations), data lines may not. However newlines can be
-included in data by means of the \n escape.
-
-If the -p option is given to pcretest, it is equivalent to adding /P to each
-regular expression: the POSIX wrapper API is used to call PCRE. None of the
-following flags has any effect in this case.
-
-If the option -d is given to pcretest, it is equivalent to adding /D to each
-regular expression: the internal form is output after compilation.
-
-If the option -i is given to pcretest, it is equivalent to adding /I to each
-regular expression: information about the compiled pattern is given after
-compilation.
-
-If the option -m is given to pcretest, it outputs the size of each compiled
-pattern after it has been compiled. It is equivalent to adding /M to each
-regular expression. For compatibility with earlier versions of pcretest, -s is
-a synonym for -m.
-
-If the -t option is given, each compile, study, and match is run 20000 times
-while being timed, and the resulting time per compile or match is output in
-milliseconds. Do not set -t with -s, because you will then get the size output
-20000 times and the timing will be distorted. If you want to change the number
-of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
-pcretest.c
-
-Philip Hazel <ph10@cam.ac.uk>
-January 2000
+
+SYNOPSIS
+ pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
+ tination]
+
+ pcretest was written as a test program for the PCRE regular
+ expression library itself, but it can also be used for
+ experimenting with regular expressions. This man page
+ describes the features of the test program; for details of
+ the regular expressions themselves, see the pcre man page.
+
+
+
+OPTIONS
+ -d Behave as if each regex had the /D modifier (see
+ below); the internal form is output after compila-
+ tion.
+
+ -i Behave as if each regex had the /I modifier;
+ information about the compiled pattern is given
+ after compilation.
+
+ -m Output the size of each compiled pattern after it
+ has been compiled. This is equivalent to adding /M
+ to each regular expression. For compatibility with
+ earlier versions of pcretest, -s is a synonym for
+ -m.
+
+ -o osize Set the number of elements in the output vector
+ that is used when calling PCRE to be osize. The
+ default value is 45, which is enough for 14 cap-
+ turing subexpressions. The vector size can be
+ changed for individual matching calls by including
+ \O in the data line (see below).
+
+ -p Behave as if each regex has /P modifier; the POSIX
+ wrapper API is used to call PCRE. None of the
+ other options has any effect when -p is set.
+
+ -t Run each compile, study, and match 20000 times
+ with a timer, and output resulting time per com-
+ pile or match (in milliseconds). Do not set -t
+ with -m, because you will then get the size output
+ 20000 times and the timing will be distorted.
+
+
+
+DESCRIPTION
+ If pcretest is given two filename arguments, it reads from
+ the first and writes to the second. If it is given only one
+
+
+
+
+SunOS 5.8 Last change: 1
+
+
+
+ filename argument, it reads from that file and writes to
+ stdout. Otherwise, it reads from stdin and writes to stdout,
+ and prompts for each line of input, using "re>" to prompt
+ for regular expressions, and "data>" to prompt for data
+ lines.
+
+ The program handles any number of sets of input on a single
+ input file. Each set starts with a regular expression, and
+ continues with any number of data lines to be matched
+ against the pattern. An empty line signals the end of the
+ data lines, at which point a new regular expression is read.
+ The regular expressions are given enclosed in any non-
+ alphameric delimiters other than backslash, for example
+
+ /(a|bc)x+yz/
+
+ White space before the initial delimiter is ignored. A regu-
+ lar expression may be continued over several input lines, in
+ which case the newline characters are included within it. It
+ is possible to include the delimiter within the pattern by
+ escaping it, for example
+
+ /abc\/def/
+
+ If you do so, the escape and the delimiter form part of the
+ pattern, but since delimiters are always non-alphameric,
+ this does not affect its interpretation. If the terminating
+ delimiter is immediately followed by a backslash, for exam-
+ ple,
+
+ /abc/\
+
+ then a backslash is added to the end of the pattern. This is
+ done to provide a way of testing the error condition that
+ arises if a pattern finishes with a backslash, because
+
+ /abc\/
+
+ is interpreted as the first line of a pattern that starts
+ with "abc/", causing pcretest to read the next line as a
+ continuation of the regular expression.
+
+
+
+PATTERN MODIFIERS
+ The pattern may be followed by i, m, s, or x to set the
+ PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
+ options, respectively. For example:
+
+ /caseless/i
+
+ These modifier letters have the same effect as they do in
+ Perl. There are others which set PCRE options that do not
+ correspond to anything in Perl: /A, /E, and /X set
+ PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respec-
+ tively.
+
+ Searching for all possible matches within each subject
+ string can be requested by the /g or /G modifier. After
+ finding a match, PCRE is called again to search the
+ remainder of the subject string. The difference between /g
+ and /G is that the former uses the startoffset argument to
+ pcre_exec() to start searching at a new point within the
+ entire string (which is in effect what Perl does), whereas
+ the latter passes over a shortened substring. This makes a
+ difference to the matching process if the pattern begins
+ with a lookbehind assertion (including \b or \B).
+
+ If any call to pcre_exec() in a /g or /G sequence matches an
+ empty string, the next call is done with the PCRE_NOTEMPTY
+ and PCRE_ANCHORED flags set in order to search for another,
+ non-empty, match at the same point. If this second match
+ fails, the start offset is advanced by one, and the normal
+ match is retried. This imitates the way Perl handles such
+ cases when using the /g modifier or the split() function.
+
+ There are a number of other modifiers for controlling the
+ way pcretest operates.
+
+ The /+ modifier requests that as well as outputting the sub-
+ string that matched the entire pattern, pcretest should in
+ addition output the remainder of the subject string. This is
+ useful for tests where the subject contains multiple copies
+ of the same substring.
+
+ The /L modifier must be followed directly by the name of a
+ locale, for example,
+
+ /pattern/Lfr
+
+ For this reason, it must be the last modifier letter. The
+ given locale is set, pcre_maketables() is called to build a
+ set of character tables for the locale, and this is then
+ passed to pcre_compile() when compiling the regular expres-
+ sion. Without an /L modifier, NULL is passed as the tables
+ pointer; that is, /L applies only to the expression on which
+ it appears.
+
+ The /I modifier requests that pcretest output information
+ about the compiled expression (whether it is anchored, has a
+ fixed first character, and so on). It does this by calling
+ pcre_fullinfo() after compiling an expression, and output-
+ ting the information it gets back. If the pattern is stu-
+ died, the results of that are also output.
+ The /D modifier is a PCRE debugging feature, which also
+ assumes /I. It causes the internal form of compiled regular
+ expressions to be output after compilation.
+
+ The /S modifier causes pcre_study() to be called after the
+ expression has been compiled, and the results used when the
+ expression is matched.
+
+ The /M modifier causes the size of memory block used to hold
+ the compiled pattern to be output.
+
+ The /P modifier causes pcretest to call PCRE via the POSIX
+ wrapper API rather than its native API. When this is done,
+ all other modifiers except /i, /m, and /+ are ignored.
+ REG_ICASE is set if /i is present, and REG_NEWLINE is set if
+ /m is present. The wrapper functions force
+ PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
+ REG_NEWLINE is set.
+
+ The /8 modifier causes pcretest to call PCRE with the
+ PCRE_UTF8 option set. This turns on the (currently incom-
+ plete) support for UTF-8 character handling in PCRE, pro-
+ vided that it was compiled with this support enabled. This
+ modifier also causes any non-printing characters in output
+ strings to be printed using the \x{hh...} notation if they
+ are valid UTF-8 sequences.
+
+
+
+DATA LINES
+ Before each data line is passed to pcre_exec(), leading and
+ trailing whitespace is removed, and it is then scanned for \
+ escapes. The following are recognized:
+
+ \a alarm (= BEL)
+ \b backspace
+ \e escape
+ \f formfeed
+ \n newline
+ \r carriage return
+ \t tab
+ \v vertical tab
+ \nnn octal character (up to 3 octal digits)
+ \xhh hexadecimal character (up to 2 hex digits)
+ \x{hh...} hexadecimal UTF-8 character
+
+ \A pass the PCRE_ANCHORED option to pcre_exec()
+ \B pass the PCRE_NOTBOL option to pcre_exec()
+ \Cdd call pcre_copy_substring() for substring dd
+ after a successful match (any decimal number
+ less than 32)
+ \Gdd call pcre_get_substring() for substring dd
+
+ after a successful match (any decimal number
+ less than 32)
+ \L call pcre_get_substringlist() after a
+ successful match
+ \N pass the PCRE_NOTEMPTY option to pcre_exec()
+ \Odd set the size of the output vector passed to
+ pcre_exec() to dd (any number of decimal
+ digits)
+ \Z pass the PCRE_NOTEOL option to pcre_exec()
+
+ When \O is used, it may be higher or lower than the size set
+ by the -O option (or defaulted to 45); \O applies only to
+ the call of pcre_exec() for the line in which it appears.
+
+ A backslash followed by anything else just escapes the any-
+ thing else. If the very last character is a backslash, it is
+ ignored. This gives a way of passing an empty line as data,
+ since a real empty line terminates the data input.
+
+ If /P was present on the regex, causing the POSIX wrapper
+ API to be used, only B, and Z have any effect, causing
+ REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
+ tively.
+
+ The use of \x{hh...} to represent UTF-8 characters is not
+ dependent on the use of the /8 modifier on the pattern. It
+ is recognized always. There may be any number of hexadecimal
+ digits inside the braces. The result is from one to six
+ bytes, encoded according to the UTF-8 rules.
+
+
+
+OUTPUT FROM PCRETEST
+ When a match succeeds, pcretest outputs the list of captured
+ substrings that pcre_exec() returns, starting with number 0
+ for the string that matched the whole pattern. Here is an
+ example of an interactive pcretest run.
+
+ $ pcretest
+ PCRE version 2.06 08-Jun-1999
+
+ re> /^abc(\d+)/
+ data> abc123
+ 0: abc123
+ 1: 123
+ data> xyz
+ No match
+
+ If the strings contain any non-printing characters, they are
+ output as \0x escapes, or as \x{...} escapes if the /8
+ modifier was present on the pattern. If the pattern has the
+ /+ modifier, then the output for substring 0 is followed by
+ the the rest of the subject string, identified by "0+" like
+ this:
+
+ re> /cat/+
+ data> cataract
+ 0: cat
+ 0+ aract
+
+ If the pattern has the /g or /G modifier, the results of
+ successive matching attempts are output in sequence, like
+ this:
+
+ re> /\Bi(\w\w)/g
+ data> Mississippi
+ 0: iss
+ 1: ss
+ 0: iss
+ 1: ss
+ 0: ipp
+ 1: pp
+
+ "No match" is output only if the first match attempt fails.
+
+ If any of the sequences \C, \G, or \L are present in a data
+ line that is successfully matched, the substrings extracted
+ by the convenience functions are output with C, G, or L
+ after the string number instead of a colon. This is in addi-
+ tion to the normal full list. The string length (that is,
+ the return from the extraction function) is given in
+ parentheses after each string for \C and \G.
+
+ Note that while patterns can be continued over several lines
+ (a plain ">" prompt is used for continuations), data lines
+ may not. However newlines can be included in data by means
+ of the \n escape.
+
+
+
+AUTHOR
+ Philip Hazel <ph10@cam.ac.uk>
+ University Computing Service,
+ New Museums Site,
+ Cambridge CB2 3QG, England.
+ Phone: +44 1223 334714
+
+ Last updated: 15 August 2001
+ Copyright (c) 1997-2001 University of Cambridge.