Remix.run Logo
masfuerte 5 hours ago

As originally written, doesn't it go from the start of the first match to the end of the last match? I feel like I'm missing something.

ieviev 4 hours ago | parent | next [-]

It goes from start of the first match to the longest "alive" end, in practice it will go to a dead state and return after finding the match end.

there's an implicit `.*` in front of the first pass but i felt it would've been a long tangent so i didn't want to get into it.

so given input 'aabbcc' and pattern `b+`,

first reverse pass (using `.*b+`) marks 'aa|b|bcc'<-

the forward pass starts from the first match:

'aa->b|b|cc' marking 2 ends

then enters a dead state after the first 'c' and returns the longest end: aa|bb|cc

i hope this explains it better

masfuerte 3 hours ago | parent [-]

Cheers. I was more confused by how you were doing multiple matches. So I read the paper, which describes the AllEnds algorithm. If I understand correctly, the reverse pass captures all of the match starts and these need to be remembered for the forward pass. Which is what you were showing above, but I didn't follow it.

So, once it gets going, a traditional engine can produce matches iteratively with no further allocation, but RE# requires allocation proportional to the total number of matches. And in return, it's very much faster and much easier to use (with intersection and complement).

ieviev 3 hours ago | parent [-]

Yes, exactly correct

It's also beneficial to merge some of the matching locations into ranges where possible, so when `a*` matches a long sequence of '|a|a|a|a|a|', it can be represented as a range of (0,5), we do this to keep the lookaround internal states smaller in the engine.

mananaysiempre 3 hours ago | parent | prev [-]

Right, the explanation seems to be a bit oversimplified, but I don’t think it’s difficult to fix it up: you need to collect non-overlapping starts (with an RTL scan) and ends (with an LTR scan) and zip them together. The non-overlapping matches are the last ones you see before you need to reset the matcher (traverse a failing edge). This feels like it should work.

(I tried to write some pseudocode here but got annoyed dealing with edge cases like zero-length matches at EOF, sorry.)