A Kraken-like tool with k given at query time and an index for finding approximately longest common substrings
Date
2023-08-23
Authors
Kashgouli, Sana
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
For highly repetitive texts such as pangenomic databases, indexes based on grammars (or Lempel-Ziv parses, string attractors, etc.) are usually significantly smaller than indexes based on the Burrows-Wheeler Transform. Nevertheless, they are not widely used in practice, probably because research on grammar-based indexes has focused almost exclusively on exact pattern matching. In the first main part of this thesis, we describe a new tool, KATKA, that stores a phylogenetic tree T such that later, given a pattern P[1..m] and an integer k, it can quickly return the root of the smallest subtree of T containing all the genomes in which the k-mer P[i..i + k – 1] occurs, for 1 ≤ i ≤ m – k + 1. This is similar to the functionality of the popular metagenomic-classification tool KRAKEN but with k given at query time instead of at construction time. In the second main part, we show how, given positive constants ϵ and δ, and an α-balanced straight-line program with g rules for a text T[1..n], we can build an O(g)-space index that, given a pattern P[1..m], in O(mlogδ g) time finds with high probability a substring of P that occurs in T and whose length is at least a (1 – ϵ) fraction of the longest common substring of P and T. The correctness can be ensured within the same query time in expectation. KATKA is currently based on an LZ77-index, but we discuss how we may be able to implement it on top of a grammar-based index, taking advantage of our results in the second part. We close with a brief discussion of future work.
Description
Keywords
Indexing, pattern matching