Emergent Mind

Efficient Computation of Sequence Mappability

(1807.11702)
Published Jul 31, 2018 in cs.DS

Abstract

In the $(k,m)$-mappability problem, for a given sequence $T$ of length $n$, the goal is to compute a table whose $i$th entry is the number of indices $j \ne i$ such that the length-$m$ substrings of $T$ starting at positions $i$ and $j$ have at most $k$ mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $k=1$. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for $k=\mathcal{O}(1)$, works in $\mathcal{O}(n)$ space and, with high probability, in $\mathcal{O}(n \cdot \min{mk,\logk n})$ time. Our algorithm requires a careful adaptation of the $k$-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop $\mathcal{O}(n2)$-time algorithms to compute all $(k,m)$-mappability tables for a fixed $m$ and all $k\in {0,\ldots,m}$ or a fixed $k$ and all $m\in{k,\ldots,n}$. Finally, we show that, for $k,m = \Theta(\log n)$, the $(k,m)$-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper that was presented at SPIRE 2018.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.