8,392 research outputs found
Learning to Hash for Indexing Big Data - A Survey
The explosive growth in big data has attracted much attention in designing
efficient indexing and search methods recently. In many critical applications
such as large-scale search and pattern matching, finding the nearest neighbors
to a query is a fundamental research problem. However, the straightforward
solution using exhaustive comparison is infeasible due to the prohibitive
computational complexity and memory requirement. In response, Approximate
Nearest Neighbor (ANN) search based on hashing techniques has become popular
due to its promising performance in both efficiency and accuracy. Prior
randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore
data-independent hash functions with random projections or permutations.
Although having elegant theoretic guarantees on the search quality in certain
metric spaces, performance of randomized hashing has been shown insufficient in
many real-world applications. As a remedy, new approaches incorporating
data-driven learning methods in development of advanced hash functions have
emerged. Such learning to hash methods exploit information such as data
distributions or class labels when optimizing the hash codes or functions.
Importantly, the learned hash codes are able to preserve the proximity of
neighboring data in the original feature spaces in the hash code spaces. The
goal of this paper is to provide readers with systematic understanding of
insights, pros and cons of the emerging techniques. We provide a comprehensive
survey of the learning to hash framework and representative techniques of
various types, including unsupervised, semi-supervised, and supervised. In
addition, we also summarize recent hashing approaches utilizing the deep
learning models. Finally, we discuss the future direction and trends of
research in this area
EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models
This paper describes EMBER: a labeled benchmark dataset for training machine
learning models to statically detect malicious Windows portable executable
files. The dataset includes features extracted from 1.1M binary files: 900K
training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test
samples (100K malicious, 100K benign). To accompany the dataset, we also
release open source code for extracting features from additional binaries so
that additional sample features can be appended to the dataset. This dataset
fills a void in the information security machine learning community: a
benign/malicious dataset that is large, open and general enough to cover
several interesting use cases. We enumerate several use cases that we
considered when structuring the dataset. Additionally, we demonstrate one use
case wherein we compare a baseline gradient boosted decision tree model trained
using LightGBM with default settings to MalConv, a recently published
end-to-end (featureless) deep learning model for malware detection. Results
show that even without hyper-parameter optimization, the baseline EMBER model
outperforms MalConv. The authors hope that the dataset, code and baseline model
provided by EMBER will help invigorate machine learning research for malware
detection, in much the same way that benchmark datasets have advanced computer
vision research
A Deep Hashing Learning Network
Hashing-based methods seek compact and efficient binary codes that preserve
the neighborhood structure in the original data space. For most existing
hashing methods, an image is first encoded as a vector of hand-crafted visual
feature, followed by a hash projection and quantization step to get the compact
binary vector. Most of the hand-crafted features just encode the low-level
information of the input, the feature may not preserve the semantic
similarities of images pairs. Meanwhile, the hashing function learning process
is independent with the feature representation, so the feature may not be
optimal for the hashing projection. In this paper, we propose a supervised
hashing method based on a well designed deep convolutional neural network,
which tries to learn hashing code and compact representations of data
simultaneously. The proposed model learn the binary codes by adding a compact
sigmoid layer before the loss layer. Experiments on several image data sets
show that the proposed model outperforms other state-of-the-art methods.Comment: 7 pages, 5 figure
On Hash-Based Work Distribution Methods for Parallel Best-First Search
Parallel best-first search algorithms such as Hash Distributed A* (HDA*)
distribute work among the processes using a global hash function. We analyze
the search and communication overheads of state-of-the-art hash-based parallel
best-first search algorithms, and show that although Zobrist hashing, the
standard hash function used by HDA*, achieves good load balance for many
domains, it incurs significant communication overhead since almost all
generated nodes are transferred to a different processor than their parents. We
propose Abstract Zobrist hashing, a new work distribution method for parallel
search which, instead of computing a hash value based on the raw features of a
state, uses a feature projection function to generate a set of abstract
features which results in a higher locality, resulting in reduced
communications overhead. We show that Abstract Zobrist hashing outperforms
previous methods on search domains using hand-coded, domain specific feature
projection functions. We then propose GRAZHDA*, a graph-partitioning based
approach to automatically generating feature projection functions. GRAZHDA*
seeks to approximate the partitioning of the actual search space graph by
partitioning the domain transition graph, an abstraction of the state space
graph. We show that GRAZHDA* outperforms previous methods on domain-independent
planning.Comment: Source code of domain-specific solvers in multicore environment:
https://github.com/jinnaiyuu/Parallel-Best-First-Searches Source code of
classical planning in distributed environment:
https://github.com/jinnaiyuu/distributed-fast-downwar
LSH Ensemble: Internet-Scale Domain Search
We study the problem of domain search where a domain is a set of distinct
values from an unspecified universe. We use Jaccard set containment, defined as
, as the relevance measure of a domain to a query domain
. Our choice of Jaccard set containment over Jaccard similarity makes our
work particularly suitable for searching Open Data and data on the web, as
Jaccard similarity is known to have poor performance over sets with large
differences in their domain sizes. We demonstrate that the domains found in
several real-life Open Data and web data repositories show a power-law
distribution over their domain sizes.
We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble,
that solves the domain search problem using set containment at Internet scale.
Our index structure and search algorithm cope with the data volume and skew by
means of data sketches (MinHash) and domain partitioning. Our index structure
does not assume a prescribed set of values. We construct a cost model that
describes the accuracy of LSH Ensemble with any given partitioning. This allows
us to formulate the partitioning for LSH Ensemble as an optimization problem.
We prove that there exists an optimal partitioning for any distribution.
Furthermore, for datasets following a power-law distribution, as observed in
Open Data and Web data corpora, we show that the optimal partitioning can be
approximated using equi-depth, making it efficient to use in practice.
We evaluate our algorithm using real data (Canadian Open Data and WDC Web
Tables) containing up over 262 M domains. The experiments demonstrate that our
index consistently outperforms other leading alternatives in accuracy and
performance. The improvements are most dramatic for data with large skew in the
domain sizes. Even at 262 M domains, our index sustains query performance with
under 3 seconds response time.Comment: To appear in VLDB 201
A Survey of Parallel A*
A* is a best-first search algorithm for finding optimal-cost paths in graphs.
A* benefits significantly from parallelism because in many applications, A* is
limited by memory usage, so distributed memory implementations of A* that use
all of the aggregate memory on the cluster enable problems that can not be
solved by serial, single-machine implementations to be solved. We survey
approaches to parallel A*, focusing on decentralized approaches to A* which
partition the state space among processors. We also survey approaches to
parallel, limited-memory variants of A* such as parallel IDA*.Comment: arXiv admin note: text overlap with arXiv:1201.320
Data-Parallel Hashing Techniques for GPU Architectures
Hash tables are one of the most fundamental data structures for effectively
storing and accessing sparse data, with widespread usage in domains ranging
from computer graphics to machine learning. This study surveys the
state-of-the-art research on data-parallel hashing techniques for emerging
massively-parallel, many-core GPU architectures. Key factors affecting the
performance of different hashing schemes are discovered and used to suggest
best practices and pinpoint areas for further research
Graph Kernels based on High Order Graphlet Parsing and Hashing
Graph-based methods are known to be successful in many machine learning and
pattern classification tasks. These methods consider semi-structured data as
graphs where nodes correspond to primitives (parts, interest points, segments,
etc.) and edges characterize the relationships between these primitives.
However, these non-vectorial graph data cannot be straightforwardly plugged
into off-the-shelf machine learning algorithms without a preliminary step of --
explicit/implicit -- graph vectorization and embedding. This embedding process
should be resilient to intra-class graph variations while being highly
discriminant. In this paper, we propose a novel high-order stochastic graphlet
embedding (SGE) that maps graphs into vector spaces. Our main contribution
includes a new stochastic search procedure that efficiently parses a given
graph and extracts/samples unlimitedly high-order graphlets. We consider these
graphlets, with increasing orders, to model local primitives as well as their
increasingly complex interactions. In order to build our graph representation,
we measure the distribution of these graphlets into a given graph, using
particular hash functions that efficiently assign sampled graphlets into
isomorphic sets with a very low probability of collision. When combined with
maximum margin classifiers, these graphlet-based representations have positive
impact on the performance of pattern comparison and recognition as corroborated
through extensive experiments using standard benchmark databases.Comment: arXiv admin note: substantial text overlap with arXiv:1702.0015
SHIP: A Scalable High-performance IPv6 Lookup Algorithm that Exploits Prefix Characteristics
Due to the emergence of new network applications, current IP lookup engines
must support high-bandwidth, low lookup latency and the ongoing growth of IPv6
networks. However, existing solutions are not designed to address jointly those
three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that
exploits prefix characteristics to build a two-level data structure designed to
meet future application requirements. Using both prefix length distribution and
prefix density, SHIP first clusters prefixes into groups sharing similar
characteristics, then it builds a hybrid trie-tree for each prefix group. The
compact and scalable data structure built can be stored in on-chip low-latency
memories, and allows the traversal process to be parallelized and pipelined at
each level in order to support high packet bandwidth. Evaluated on real and
synthetic prefix tables holding up to 580 k IPv6 prefixes, SHIP has a
logarithmic scaling factor in terms of the number of memory accesses, and a
linear memory consumption scaling. Using the largest synthetic prefix table,
simulations show that compared to other well-known approaches, SHIP uses at
least 44% less memory per prefix, while reducing the memory latency by 61%.Comment: Submitted to EEE/ACM Transactions on Networkin
ASURA: Scalable and Uniform Data Distribution Algorithm for Storage Clusters
Large-scale storage cluster systems need to manage a vast amount of data
locations. A naive data locations management maintains pairs of data ID and
nodes storing the data in tables. However, it is not practical when the number
of pairs is too large. To solve this problem, management using data
distribution algorithms, rather than management using tables, has been proposed
in recent research. It can distribute data by determining the node for storing
the data based on the datum ID. Such data distribution algorithms require the
ability to handle the addition or removal of nodes, short calculation time and
uniform data distribution in the capacity of each node. This paper proposes a
data distribution algorithm called ASURA (Advanced Scalable and Uniform storage
by Random number Algorithm) that satisfies these requirements. It achieves
following four characteristics: 1) minimum data movement to maintain data
distribution according to node capacity when nodes are added or removed, even
if data are replicated, 2) roughly sub-micro-seconds calculation time, 3) much
lower than 1% maximum variability between nodes in data distribution, and 4)
data distribution according to the capacity of each node. The evaluation
results show that ASURA is qualitatively and quantitatively competitive against
major data distribution algorithms such as Consistent Hashing, Weighted
Rendezvous Hashing and Random Slicing. The comparison results show benefits of
each algorithm; they show that ASURA has advantage in large scale-out storage
clusters.Comment: 14 page
- …