No. n in the case of a DFA is the length of the *input* string. (i.e. the string...

thedufer · on Nov 19, 2014

Oh, right. My CS has gotten fuzzy - the tree gets enormous, but the runtime stays is O(n) on input length. Thanks for the explanation.

quotemstr · on Nov 19, 2014

The automaton isn't even that big, really. The number of NFA states is roughly proportional to the total number of characters in the regular expressions involved, and NFA to DFA conversion usually expands the automaton only be a factor of two or three. Although the subset construction in theory worst-case exponential, that situation never occurs in practice.

thedufer · on Nov 19, 2014

By that logic the app I work on would have a DFA with ~200,000 nodes. That sounds fairly large to me.

I suspect this is a slight overestimate, but the point stands - non-trivial apps could quickly build very large state graphs.

jerf · on Nov 19, 2014

Yes, but unless you're running on embedded, and to be honest to some extent even then unless you're way at the bottom of the power scale, you're running on a supercomputer that will laugh at 200,000 nodes in a DFA, and it's very easy for that DFA to run significantly faster than the presumably several tens of thousands of FAs you're running sequentially otherwise.

ufo · on Nov 19, 2014

The exponential blowup only occurs in the worst case. I don't have a proof at hand but I suspect that the final automaton shouldn't grow very large if your regexes are modeling web routes, since they are structured as a tree of paths with lots of "prefix-sharing" and no loops.

Other than that, if you are concerned about automaton size you can use a non-deterministc automaton instead of a deterministic one. You spend a bit more time (at each iteration you have to update a list of "active" states, while in the DFA case there is always a single active state) but on the other hand the automaton itself is much more compact.

microtonal · on Nov 19, 2014

By that logic the app I work on would have a DFA with ~200,000 nodes. That sounds fairly large to me.

That's not very large, since you can store a DFA very compactly in a table (C array with structs). The table stores the transitions, which only have the character and the target state. Then you either keep an array with state offsets in the transition table, or you modify transition tables such that each transition points at the first transition of the target state.