Nanopore sequencers generate electrical raw signals in real-time while
sequencing long genomic strands. These raw signals can be analyzed as they are
generated, providing an opportunity for real-time genome analysis. An important
feature of nanopore sequencing, Read Until, can eject strands from sequencers
without fully sequencing them, which provides opportunities to computationally
reduce the sequencing time and cost. However, existing works utilizing Read
Until either 1) require powerful computational resources that may not be
available for portable sequencers or 2) lack scalability for large genomes,
rendering them inaccurate or ineffective.
We propose RawHash, the first mechanism that can accurately and efficiently
perform real-time analysis of nanopore raw signals for large genomes using a
hash-based similarity search. To enable this, RawHash ensures the signals
corresponding to the same DNA content lead to the same hash value, regardless
of the slight variations in these signals. RawHash achieves an accurate
hash-based similarity search via an effective quantization of the raw signals
such that signals corresponding to the same DNA content have the same quantized
value and, subsequently, the same hash value.
We evaluate RawHash on three applications: 1) read mapping, 2) relative
abundance estimation, and 3) contamination analysis. Our evaluations show that
RawHash is the only tool that can provide high accuracy and high throughput for
analyzing large genomes in real-time. When compared to the state-of-the-art
techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better
average throughput and 2) an average speedup of 32.1x and 2.1x in the mapping
time, respectively.
Source code is available at https://github.com/CMU-SAFARI/RawHash