Remix.run Logo
ieviev 3 hours ago

We refer to this in the paper as well,

The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative.

It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent.

How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it.

The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases

btown 3 hours ago | parent [-]

> The second time the same (or similar) input is used these states are already created and it is linear.

Does this imply that the DFA for a regex, as an internal cache, is mutable and persisted between inputs? Could this lead to subtle denial-of-service attacks, where inputs are chosen by an attacker to steadily increase the cached complexity - are there eviction techniques to guard against this? And how might this work in a multi-threaded environment?

ieviev 2 hours ago | parent [-]

Yes, most (i think all) lazy DFA engines have a mutable DFA behind a lock internally that grows during matching.

Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

The subtle denial of service part is interesting, i haven't thought of it before. Yes this is possible. For security-critical uses i would compile the full DFA ahead of time - the memory cost may be painful but this completely removes the chance of anything going wrong.

There are valid arguments to switch from DFA to NFA with large state spaces, but RE# intentionally does not switch to a NFA and capitalizes on reducing the DFA memory costs instead (eg. minterm compression in the post, algebraic simplifications in the paper).

The problem with going from DFA to NFA for large state spaces is that this makes the match time performance fall off a cliff - something like going from 1GB/s to 1KB/s as we also show in the benchmarks in the paper.

As for eviction techniques i have not researched this, the simplest thing to do is just completely reset the instance and rebuild past a certain size, but likely there is a better way.