14,458 research outputs found
A practical index for approximate dictionary matching with few mismatches
Approximate dictionary matching is a classic string matching problem
(checking if a query string occurs in a collection of strings) with
applications in, e.g., spellchecking, online catalogs, geolocation, and web
searchers. We present a surprisingly simple solution called a split index,
which is based on the Dirichlet principle, for matching a keyword with few
mismatches, and experimentally show that it offers competitive space-time
tradeoffs. Our implementation in the C++ language is focused mostly on data
compaction, which is beneficial for the search speed (e.g., by being cache
friendly). We compare our solution with other algorithms and we show that it
performs better for the Hamming distance. Query times in the order of 1
microsecond were reported for one mismatch for the dictionary size of a few
megabytes on a medium-end PC. We also demonstrate that a basic compression
technique consisting in -gram substitution can significantly reduce the
index size (up to 50% of the input text size for the DNA), while still keeping
the query time relatively low
Bitwise Source Separation on Hashed Spectra: An Efficient Posterior Estimation Scheme Using Partial Rank Order Metrics
This paper proposes an efficient bitwise solution to the single-channel
source separation task. Most dictionary-based source separation algorithms rely
on iterative update rules during the run time, which becomes computationally
costly especially when we employ an overcomplete dictionary and sparse encoding
that tend to give better separation results. To avoid such cost we propose a
bitwise scheme on hashed spectra that leads to an efficient posterior
probability calculation. For each source, the algorithm uses a partial rank
order metric to extract robust features that form a binarized dictionary of
hashed spectra. Then, for a mixture spectrum, its hash code is compared with
each source's hashed dictionary in one pass. This simple voting-based
dictionary search allows a fast and iteration-free estimation of ratio masking
at each bin of a signal spectrogram. We verify that the proposed BitWise Source
Separation (BWSS) algorithm produces sensible source separation results for the
single-channel speech denoising task, with 6-8 dB mean SDR. To our knowledge,
this is the first dictionary based algorithm for this task that is completely
iteration-free in both training and testing
On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching
We present parallel algorithms for exact and approximate pattern matching
with suffix arrays, using a CREW-PRAM with processors. Given a static text
of length , we first show how to compute the suffix array interval of a
given pattern of length in
time for . For approximate pattern matching with differences or
mismatches, we show how to compute all occurrences of a given pattern in
time, where is the size of the alphabet
and . The workhorse of our algorithms is a data structure
for merging suffix array intervals quickly: Given the suffix array intervals
for two patterns and , we present a data structure for computing the
interval of in sequential time, or in
parallel time. All our data structures are of size bits (in addition to
the suffix array)
- …