720 research outputs found
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Siamese coding network and pair similarity prediction for near-duplicate image detection
Near-duplicate detection in a dataset involves finding the elements that are closest to a new query element according to a given similarity function and proximity threshold. The brute force approach is very computationally intensive as it evaluates the similarity between the queried item and all items in the dataset. The potential application domain is an image sharing website that checks for plagiarism or piracy every time a new image is uploaded. Among the various approaches, near-duplicate detection was effectively addressed by SimPair LSH (Fisichella et al., in Decker, Lhotská, Link, Spies, Wagner (eds) Database and expert systems applications, Springer, 2014). As the name suggests, SimPair LSH uses locality sensitive hashing (LSH) and computes and stores in advance a small set of near-duplicate pairs present in the dataset and uses them to reduce the candidate set returned for a given query using the Triangle inequality. We develop an algorithm that predicts how the candidate set will be reduced. We also develop a new efficient method for near-duplicate image detection using a deep Siamese coding neural network that is able to extract effective features from images useful for building LSH indices. Extensive experiments on two benchmark datasets confirm the effectiveness of our deep Siamese coding network and prediction algorithm
Theory and applications of hashing: report from Dagstuhl Seminar 17181
This report documents the program and the topics discussed of the 4-day Dagstuhl Seminar 17181 “Theory and Applications of Hashing”, which took place May 1–5, 2017. Four long and eighteen short talks covered a wide and diverse range of topics within the theme of the workshop. The program left sufficient space for informal discussions among the 40 participants
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes
Master's thesis in Computer scienceINTRODUCTION: Although hash functions are nothing new, these are not limited
to cryptographic purposes. One important field is data fingerprinting. Here,
the purpose is to generate a digest which serves as a fingerprint (or a license plate)
that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which
will scrap the avalanche effect in favour of detecting local changes — has hit the
spotlight. The main purpose of this project is to find ways to classify text tables,
and discover where potential changes or inconsitencies have happened.
METHODS: Large parts of this report can be considered applied discrete mathematics
— and finite fields and combinatorics have played an important part. Rabin’s
fingerprinting scheme was tested extensively and compared against existing
cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing
algorithm with the preliminary name No-Frills Hash has been created and tested
against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a
sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables
(with partial seeds) were also explored. The fuzzy hashing algorithm has also been
combined with a k-NN classifier to get an overview over it’s ability to classify files.
In addition to NFHash, Bloom filters combined with Merkle Trees have been the
most important part of this report. This combination will allow a user to see where
a change was made, despite the fact that hash functions are one-way. Large parts of
this project has dealt with the study of other open-source libraries and applications,
such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have
played a crucial role as well; different approaches to a problem might lead to the
same solution, but resource consumption can be very different.
RESULTS: The results have shown that the Merkle Tree-based approach can track
changes to a table very quickly and efficiently, due to it being conservative when it
comes to CPU resources. Moreover, the self-designed algorithm NFHash also does
well in terms of file classification when it is coupled with a k-NN classifyer.
CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not
just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still
be considered to be at their infant stage, but a lot has still happened the last ten
years. This project has introduced two new ways to create and compare hashes that
can be compared to similar, yet not necessarily identical files — or to detect if (and
to what extent) a file was changed. Note that the algorithms presented here should
be considered prototypes, and still might need some large scale testing to sort out
potential flaw
- …