Repository logo

A Kraken-like tool with k given at query time and an index for finding approximately longest common substrings

Loading...
Thumbnail Image

Authors

Kashgouli, Sana

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

For highly repetitive texts such as pangenomic databases, indexes based on grammars (or Lempel-Ziv parses, string attractors, etc.) are usually significantly smaller than indexes based on the Burrows-Wheeler Transform. Nevertheless, they are not widely used in practice, probably because research on grammar-based indexes has focused almost exclusively on exact pattern matching. In the first main part of this thesis, we describe a new tool, KATKA, that stores a phylogenetic tree T such that later, given a pattern P[1..m] and an integer k, it can quickly return the root of the smallest subtree of T containing all the genomes in which the k-mer P[i..i + k – 1] occurs, for 1 ≤ i ≤ m – k + 1. This is similar to the functionality of the popular metagenomic-classification tool KRAKEN but with k given at query time instead of at construction time. In the second main part, we show how, given positive constants ϵ and δ, and an α-balanced straight-line program with g rules for a text T[1..n], we can build an O(g)-space index that, given a pattern P[1..m], in O(mlogδ g) time finds with high probability a substring of P that occurs in T and whose length is at least a (1 – ϵ) fraction of the longest common substring of P and T. The correctness can be ensured within the same query time in expectation. KATKA is currently based on an LZ77-index, but we discuss how we may be able to implement it on top of a grammar-based index, taking advantage of our results in the second part. We close with a brief discussion of future work.

Description

Keywords

Indexing, pattern matching

Citation