Batched $k$-mer lookup on the Spectral Burrows-Wheeler Transform

Abstract

Since their emergence some two decades ago, indexes based on the Burrows-Wheeler transform (BWT) have been intensely studied and today find wide use in genomics, where they form the basis of software tools for read alignment and $k$-mer lookup—routine tasks in modern data-intensive bioinformatics pipelines. BWT-based indexes reduce an existential query for a pattern $P$ of length $m$ to a sequence of up to $m$ pairs of rank queries on a sequence derived from the underlying indexed data. In general these rank queries exhibit poor locality of memory reference, with each pair causing one or two cache misses, something that has become generally accepted as a limitation of these indexes. However, in the above mentioned applications a typical experimental run will search for 100s of millions—even billions—of patterns using the index. In this paper we show that, taken across such a large set of patterns, rank queries do exhibit locality of memory reference a d that this can be exploited by reorganising the order in which rank queries are issued. We show this leads to significant performance gains—in particular, $k$-mer lookup queries can be answered several times faster when a batch of patterns is treated holistically.

Publication
2025 Proceedings of the Symposium on Algorithm Engineering and Experiments
Date
Links