5,200 research outputs found
Regular and almost universal hashing: an efficient implementation
Random hashing can provide guarantees regarding the performance of data
structures such as hash tables---even in an adversarial setting. Many existing
families of hash functions are universal: given two data objects, the
probability that they have the same hash value is low given that we pick hash
functions at random. However, universality fails to ensure that all hash
functions are well behaved. We further require regularity: when picking data
objects at random they should have a low probability of having the same hash
value, for any fixed hash function. We present the efficient implementation of
a family of non-cryptographic hash functions (PM+) offering good running times,
good memory usage as well as distinguishing theoretical guarantees: almost
universality and component-wise regularity. On a variety of platforms, our
implementations are comparable to the state of the art in performance. On
recent Intel processors, PM+ achieves a speed of 4.7 bytes per cycle for 32-bit
outputs and 3.3 bytes per cycle for 64-bit outputs. We review vectorization
through SIMD instructions (e.g., AVX2) and optimizations for superscalar
execution.Comment: accepted for publication in Software: Practice and Experience in
September 201
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting
of the k elements that are smallest according to a given hash function h. With
this sample we can estimate the relative size f=|Y|/|X| of any subset Y as
|S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard
similarity f=|A intersect B|/|A union B| between sets A and B. Given the
bottom-k samples from A and B, we construct the bottom-k sample of their union
as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is
estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.
We show here that even if the hash function is only 2-independent, the
expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a
constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxmin-wise where we use k
hash independent functions h_1,...,h_k, storing the smallest element with each
hash function. For kxmin-wise there is an at least constant bias with constant
independence, and it is not reduced with larger k. Recently Feigenblat et al.
showed that bottom-k circumvents the bias if the hash function is 8-independent
and k is sufficiently large. We get down to 2-independence for any k. Our
result is based on a simply union bound, transferring generic concentration
bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger
probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to
the concrete input weights, e.g., benefiting strongly from heavy-tailed input.
This time, the analysis is much more involved, but again we show that generic
concentration bounds can be applied.Comment: A short version appeared at STOC'1
Source File Set Search for Clone-and-Own Reuse Analysis
Clone-and-own approach is a natural way of source code reuse for software
developers. To assess how known bugs and security vulnerabilities of a cloned
component affect an application, developers and security analysts need to
identify an original version of the component and understand how the cloned
component is different from the original one. Although developers may record
the original version information in a version control system and/or directory
names, such information is often either unavailable or incomplete. In this
research, we propose a code search method that takes as input a set of source
files and extracts all the components including similar files from a software
ecosystem (i.e., a collection of existing versions of software packages). Our
method employs an efficient file similarity computation using b-bit minwise
hashing technique. We use an aggregated file similarity for ranking components.
To evaluate the effectiveness of this tool, we analyzed 75 cloned components in
Firefox and Android source code. The tool took about two hours to report the
original components from 10 million files in Debian GNU/Linux packages. Recall
of the top-five components in the extracted lists is 0.907, while recall of a
baseline using SHA-1 file hash is 0.773, according to the ground truth recorded
in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie
A Case-Based Reasoning Method for Locating Evidence During Digital Forensic Device Triage
The role of triage in digital forensics is disputed, with some practitioners questioning its reliability for identifying evidential data. Although successfully implemented in the field of medicine, triage has not established itself to the same degree in digital forensics. This article presents a novel approach to triage for digital forensics. Case-Based Reasoning Forensic Triager (CBR-FT) is a method for collecting and reusing past digital forensic investigation information in order to highlight likely evidential areas on a suspect operating system, thereby helping an investigator to decide where to search for evidence. The CBR-FT framework is discussed and the results of twenty test triage examinations are presented. CBR-FT has been shown to be a more effective method of triage when compared to a practitioner using a leading commercial application
- …