On Indexing and Compressing Finite Automata (2007.07718v1)
Abstract: An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. In this paper, we solve the long-standing problem of indexing arbitrary finite automata. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that states reached by a given string form one interval on the partial order, thus enabling indexing. We provide a lower bound stating that such an interval requires $O(p)$ words to be represented, $p$ being the order's width (i.e. the size of its largest antichain). Indeed, we show that $p$ determines the complexity of several fundamental problems on finite automata: (i) Letting $\sigma$ be the alphabet size, we provide an encoding for NFAs using $\lceil\log \sigma\rceil + 2\lceil\log p\rceil + 2$ bits per transition and a smaller encoding for DFAs using $\lceil\log \sigma\rceil + \lceil\log p\rceil + 2$ bits per transition. This is achieved by generalizing the Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed pattern matching can be solved in $\tilde O(m\cdot p2)$ query time on NFAs. (iii) We provide a polynomial-time algorithm to index DFAs, while matching the optimal value for $ p $. On the other hand, we prove that the problem is NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset construction algorithm for NFA determinization generates an equivalent DFA of size $2p(n-p+1)-1$, where $n$ is the number of NFA's states.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.