Data Compression Meets Automata Theory
Date
2024-02-07
Authors
Cotumaccio, Nicola
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
We introduce a new paradigm in graph compression and formal language theory. We show that the ideas behind some of the most important data structures for compressing and indexing strings - such as the suffix array, the Burrows-Wheeler Transform and the FM-index - are much more general and provide a new approach to studying automata and regular languages, which retrospectively explains the impact of these data structures. We classify all automata and all regular languages by their propensity to be sorted. Our classification represents a useful parameterization simultaneously for diverse automata-related measures: (i) the encoding bit-complexity of automata/labeled graphs, (ii) the complexity of operations on regular languages (e.g. membership) and on labeled graphs (e.g. pattern matching), (iii) the complexity of NFA determinization by the powerset-construction algorithm. To the best of our knowledge, ours is the only parameterization of automata/labeled graphs capturing simultaneously all these aspects. We show that our parameterization has deep and unexpected consequences both in data compression (encoding, pattern matching) and in automata theory (nondeterminism, entanglement, minimization).
Description
Keywords
Data Compression, Automata Theory, Burrows-Wheeler Transform, FM-index, Regular Languages, Nondeterminism, Minimization, Suffix Trees, Suffix Arrays