tries: don't allocate memory at runtime

This is an indirect fix for [perl #74484] Regex causing exponential runtime+mem usage The trie runtime code was doing more SAVETMPS than FREETMPS and was thus growing a large tmps stack on heavy backtracking. Rather than fixing this directly, I rewrote part of the trie code so that it no longer needs to allocate memory in S_regmatch (it still does in find_byclass()). The basic issue is that multiple branches in the trie may trigger an accept state; for example: "abcd" =~ /xyz/abcd.*X|ab.*Y|/ here, words (branches) 2 and 3 are accept states. The original approach was, at run time, to create a list of accepted word numbers and the character positions of the end of each of those words. Then run the rest of the pattern for each word in the list in turn (in word index order). This requires memory for the list to be allocated and freed. The new approach involves creating extra info at compile time; in particular, for each word, a pointer to the previous accepted word (if any) in the state tree. For example for the above pattern, part of the state tree may be q b c d 1 -> 2 -> 3 -> 4 -> 5 (#3) (#2) (e.g. at state 1, if the next char is 'a', we transition to state 2). Here, state 3 is an accept state with word #3, and 5 is an accept state with word #2. So we build a table indexed by word number, which has wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0. At run time we run the trie to completion, and remember the word associated with the longest accept state (word #2 above). Then by following back the chain of .prev fields, we can produce a list of all accepting words. We then iteratively find the smallest-numbered (ie LH-most) word in the chain, and run with it. On failure and backtrack, we find the next-smallest and so on. Since we are no longer recording the end-position of each word in the string, we have to recalculate this for each backtrack. We initially record the end-position of the shortest accepting word, and given that we know the length of each word, we can calculate the new position each time as an offset from that first word. Depending on unicode and folding, that calculation can be cheap or expensive. This algorithm is optimised for the typical case where there are a small number (<= 2) accepting states. This patch creates a new compile-time array, trie->wordinfo[], indexed by word number, which contains relevant info about each word. This also supersedes the old trie->newword[] array, whose function of recording "overspills" of multiple words per accept state, is now handled as part of the wordinfo[].prev chain.
author: David Mitchell <davem@iabyn.com> 2010-05-03 13:57:58 +0100
committer: David Mitchell <davem@iabyn.com> 2010-05-03 14:29:54 +0100
commit: 2e64971a6530d2645969bc489f564bfd3ce64993 (patch)
tree: 8b5ab3ff672755f1ca62bd040b2240dbb5719934 /regcomp.h
parent: a23e6e2012eae03dbd049a058d24b0ce29357c76 (diff)
download: perl-2e64971a6530d2645969bc489f564bfd3ce64993.tar.gz
1 files changed, 11 insertions, 3 deletions
diff --git a/regcomp.h b/regcomp.h
index 20b4401ed2..a20d6e11bd 100644
--- a/regcomp.h
+++ b/regcomp.h
@@ -586,6 +586,15 @@ struct _reg_trie_state {
   } trans;
 };
 
+/* info per word; indexed by wordnum */
+typedef struct {
+    U16  prev;	/* previous word in acceptance chain; eg in
+		 * zzz|abc|ab/ after matching the chars abc, the
+		 * accepted word is #2, and the previous accepted
+		 * word is #3 */
+    U32 len;	/* how many chars long is this word? */
+    U32 accept;	/* accept state for this word */
+} reg_trie_wordinfo;
 
 
 typedef struct _reg_trie_state    reg_trie_state;
@@ -603,15 +612,14 @@ struct _reg_trie_data {
     reg_trie_state  *states;         /* state data */
     reg_trie_trans  *trans;          /* array of transition elements */
     char            *bitmap;         /* stclass bitmap */
-    U32             *wordlen;        /* array of lengths of words */
     U16 	    *jump;           /* optional 1 indexed array of offsets before tail 
                                         for the node following a given word. */
-    U16	            *nextword;       /* optional 1 indexed array to support linked list
-                                        of duplicate wordnums */
+    reg_trie_wordinfo *wordinfo;     /* array of info per word */
     U16             uniquecharcount; /* unique chars in trie (width of trans table) */
     U32             startstate;      /* initial state - used for common prefix optimisation */
     STRLEN          minlen;          /* minimum length of words in trie - build/opt only? */
     STRLEN          maxlen;          /* maximum length of words in trie - build/opt only? */
+    U32             prefixlen;       /* #chars in common prefix */
     U32             statecount;      /* Build only - number of states in the states array 
                                         (including the unused zero state) */
     U32             wordcount;       /* Build only */
author	David Mitchell <davem@iabyn.com>	2010-05-03 13:57:58 +0100
committer	David Mitchell <davem@iabyn.com>	2010-05-03 14:29:54 +0100
commit	2e64971a6530d2645969bc489f564bfd3ce64993 (patch)
tree	8b5ab3ff672755f1ca62bd040b2240dbb5719934 /regcomp.h
parent	a23e6e2012eae03dbd049a058d24b0ce29357c76 (diff)
download	perl-2e64971a6530d2645969bc489f564bfd3ce64993.tar.gz