| ▲ | someplaceguy 4 hours ago | ||||||||||||||||
> the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression The authors seem to claim linear complexity: > the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks | |||||||||||||||||
| ▲ | ieviev 3 hours ago | parent | next [-] | ||||||||||||||||
We refer to this in the paper as well, The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative. It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent. How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it. The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases | |||||||||||||||||
| |||||||||||||||||
| ▲ | mananaysiempre 3 hours ago | parent | prev [-] | ||||||||||||||||
These claims are compatible. For instance, lex, re2c, ragel, etc. are exponential in the length of the automaton description, but the resulting lexers work in linear time[1] in the length of the string. Here the situation is even better, because the DFA is constructed lazily, at most one state per input character, so to observe its full enormity relative to the size of the needle you need an equally enormous haystack. “One state per input character” somewhat understates things, because producing each state requires a non-constant[2] amount of work, and storing the new derivative’s syntax tree a non-constant amount of space; but with hash-consing and memoization it’s not bad. Either way, derivative-based matching with a lazy DFA takes something like O(needle + f(needle) × haystack) time where I’m guessing f(n) has to be O(n log n) at least (to bring large ORs into normal form by sorting) but in practice is closer to constant. Space consumption is less of an issue because if at any point your cache of derivatives (= DFA) gets too bloated you can flush it and restart from scratch. [1] Kind of, unless you hit ambiguities that need to be resolved with the maximal munch rule; anyways that’s irrelevant to a single-RE matcher. [2] In particular, introductions to Brzozowski’s approach usually omit—but his original paper does mention—that you need to do some degree of syntax-tree simplification for the derivatives to stay bounded in size (thus finite in number) and the matcher to stay linear in the haystack. | |||||||||||||||||