Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters [METHODS]

Jens-Uwe Ulrich1,2,3 and Bernhard Y. Renard1 1Data Analytics and Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany; 2Phylogenomics Unit, Center for Artificial Intelligence in Public Health Research, Robert Koch Institute, 15745 Wildau, Germany; 3Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany Corresponding authors: jens-uwe.ulrichhpi.de, bernhard.renardhpi.de Abstract

Metagenomic long-read sequencing is gaining popularity for various applications, including pathogen detection and microbiome studies. To analyze the large data created in those studies, software tools need to taxonomically classify the sequenced molecules and estimate the relative abundances of organisms in the sequenced sample. Because of the exponential growth of reference genome databases, the current taxonomic classification methods have large computational requirements. This issue motivated us to develop a new data structure for fast and memory-efficient querying of long reads. Here, we present Taxor as a new tool for long-read metagenomic classification using a hierarchical interleaved XOR filter data structure for indexing and querying large reference genome sets. Taxor implements several k-mer-based approaches, such as syncmers, for pseudoalignment to classify reads and an expectation-maximization algorithm for metagenomic profiling. Our results show that Taxor outperforms state-of-the-art tools regarding precision while having a similar recall for long-read taxonomic classification. Most notably, Taxor reduces the memory requirements and index size by >50% and is among the fastest tools regarding query times. This enables real-time metagenomics analysis with large reference databases on a small laptop in the field.

Received October 10, 2023. Accepted May 23, 2024.

Comments (0)

No login
gif