8,392 research outputs found

    Learning to Hash for Indexing Big Data - A Survey

    Full text link
    The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area

    EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

    Full text link
    This paper describes EMBER: a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). To accompany the dataset, we also release open source code for extracting features from additional binaries so that additional sample features can be appended to the dataset. This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. We enumerate several use cases that we considered when structuring the dataset. Additionally, we demonstrate one use case wherein we compare a baseline gradient boosted decision tree model trained using LightGBM with default settings to MalConv, a recently published end-to-end (featureless) deep learning model for malware detection. Results show that even without hyper-parameter optimization, the baseline EMBER model outperforms MalConv. The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research

    A Deep Hashing Learning Network

    Full text link
    Hashing-based methods seek compact and efficient binary codes that preserve the neighborhood structure in the original data space. For most existing hashing methods, an image is first encoded as a vector of hand-crafted visual feature, followed by a hash projection and quantization step to get the compact binary vector. Most of the hand-crafted features just encode the low-level information of the input, the feature may not preserve the semantic similarities of images pairs. Meanwhile, the hashing function learning process is independent with the feature representation, so the feature may not be optimal for the hashing projection. In this paper, we propose a supervised hashing method based on a well designed deep convolutional neural network, which tries to learn hashing code and compact representations of data simultaneously. The proposed model learn the binary codes by adding a compact sigmoid layer before the loss layer. Experiments on several image data sets show that the proposed model outperforms other state-of-the-art methods.Comment: 7 pages, 5 figure

    On Hash-Based Work Distribution Methods for Parallel Best-First Search

    Full text link
    Parallel best-first search algorithms such as Hash Distributed A* (HDA*) distribute work among the processes using a global hash function. We analyze the search and communication overheads of state-of-the-art hash-based parallel best-first search algorithms, and show that although Zobrist hashing, the standard hash function used by HDA*, achieves good load balance for many domains, it incurs significant communication overhead since almost all generated nodes are transferred to a different processor than their parents. We propose Abstract Zobrist hashing, a new work distribution method for parallel search which, instead of computing a hash value based on the raw features of a state, uses a feature projection function to generate a set of abstract features which results in a higher locality, resulting in reduced communications overhead. We show that Abstract Zobrist hashing outperforms previous methods on search domains using hand-coded, domain specific feature projection functions. We then propose GRAZHDA*, a graph-partitioning based approach to automatically generating feature projection functions. GRAZHDA* seeks to approximate the partitioning of the actual search space graph by partitioning the domain transition graph, an abstraction of the state space graph. We show that GRAZHDA* outperforms previous methods on domain-independent planning.Comment: Source code of domain-specific solvers in multicore environment: https://github.com/jinnaiyuu/Parallel-Best-First-Searches Source code of classical planning in distributed environment: https://github.com/jinnaiyuu/distributed-fast-downwar

    LSH Ensemble: Internet-Scale Domain Search

    Full text link
    We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as ∣Q∩X∣/∣Q∣|Q \cap X|/|Q|, as the relevance measure of a domain XX to a query domain QQ. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarity is known to have poor performance over sets with large differences in their domain sizes. We demonstrate that the domains found in several real-life Open Data and web data repositories show a power-law distribution over their domain sizes. We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Our index structure and search algorithm cope with the data volume and skew by means of data sketches (MinHash) and domain partitioning. Our index structure does not assume a prescribed set of values. We construct a cost model that describes the accuracy of LSH Ensemble with any given partitioning. This allows us to formulate the partitioning for LSH Ensemble as an optimization problem. We prove that there exists an optimal partitioning for any distribution. Furthermore, for datasets following a power-law distribution, as observed in Open Data and Web data corpora, we show that the optimal partitioning can be approximated using equi-depth, making it efficient to use in practice. We evaluate our algorithm using real data (Canadian Open Data and WDC Web Tables) containing up over 262 M domains. The experiments demonstrate that our index consistently outperforms other leading alternatives in accuracy and performance. The improvements are most dramatic for data with large skew in the domain sizes. Even at 262 M domains, our index sustains query performance with under 3 seconds response time.Comment: To appear in VLDB 201

    A Survey of Parallel A*

    Full text link
    A* is a best-first search algorithm for finding optimal-cost paths in graphs. A* benefits significantly from parallelism because in many applications, A* is limited by memory usage, so distributed memory implementations of A* that use all of the aggregate memory on the cluster enable problems that can not be solved by serial, single-machine implementations to be solved. We survey approaches to parallel A*, focusing on decentralized approaches to A* which partition the state space among processors. We also survey approaches to parallel, limited-memory variants of A* such as parallel IDA*.Comment: arXiv admin note: text overlap with arXiv:1201.320

    Data-Parallel Hashing Techniques for GPU Architectures

    Full text link
    Hash tables are one of the most fundamental data structures for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. Key factors affecting the performance of different hashing schemes are discovered and used to suggest best practices and pinpoint areas for further research

    Graph Kernels based on High Order Graphlet Parsing and Hashing

    Full text link
    Graph-based methods are known to be successful in many machine learning and pattern classification tasks. These methods consider semi-structured data as graphs where nodes correspond to primitives (parts, interest points, segments, etc.) and edges characterize the relationships between these primitives. However, these non-vectorial graph data cannot be straightforwardly plugged into off-the-shelf machine learning algorithms without a preliminary step of -- explicit/implicit -- graph vectorization and embedding. This embedding process should be resilient to intra-class graph variations while being highly discriminant. In this paper, we propose a novel high-order stochastic graphlet embedding (SGE) that maps graphs into vector spaces. Our main contribution includes a new stochastic search procedure that efficiently parses a given graph and extracts/samples unlimitedly high-order graphlets. We consider these graphlets, with increasing orders, to model local primitives as well as their increasingly complex interactions. In order to build our graph representation, we measure the distribution of these graphlets into a given graph, using particular hash functions that efficiently assign sampled graphlets into isomorphic sets with a very low probability of collision. When combined with maximum margin classifiers, these graphlet-based representations have positive impact on the performance of pattern comparison and recognition as corroborated through extensive experiments using standard benchmark databases.Comment: arXiv admin note: substantial text overlap with arXiv:1702.0015

    SHIP: A Scalable High-performance IPv6 Lookup Algorithm that Exploits Prefix Characteristics

    Full text link
    Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet future application requirements. Using both prefix length distribution and prefix density, SHIP first clusters prefixes into groups sharing similar characteristics, then it builds a hybrid trie-tree for each prefix group. The compact and scalable data structure built can be stored in on-chip low-latency memories, and allows the traversal process to be parallelized and pipelined at each level in order to support high packet bandwidth. Evaluated on real and synthetic prefix tables holding up to 580 k IPv6 prefixes, SHIP has a logarithmic scaling factor in terms of the number of memory accesses, and a linear memory consumption scaling. Using the largest synthetic prefix table, simulations show that compared to other well-known approaches, SHIP uses at least 44% less memory per prefix, while reducing the memory latency by 61%.Comment: Submitted to EEE/ACM Transactions on Networkin

    ASURA: Scalable and Uniform Data Distribution Algorithm for Storage Clusters

    Full text link
    Large-scale storage cluster systems need to manage a vast amount of data locations. A naive data locations management maintains pairs of data ID and nodes storing the data in tables. However, it is not practical when the number of pairs is too large. To solve this problem, management using data distribution algorithms, rather than management using tables, has been proposed in recent research. It can distribute data by determining the node for storing the data based on the datum ID. Such data distribution algorithms require the ability to handle the addition or removal of nodes, short calculation time and uniform data distribution in the capacity of each node. This paper proposes a data distribution algorithm called ASURA (Advanced Scalable and Uniform storage by Random number Algorithm) that satisfies these requirements. It achieves following four characteristics: 1) minimum data movement to maintain data distribution according to node capacity when nodes are added or removed, even if data are replicated, 2) roughly sub-micro-seconds calculation time, 3) much lower than 1% maximum variability between nodes in data distribution, and 4) data distribution according to the capacity of each node. The evaluation results show that ASURA is qualitatively and quantitatively competitive against major data distribution algorithms such as Consistent Hashing, Weighted Rendezvous Hashing and Random Slicing. The comparison results show benefits of each algorithm; they show that ASURA has advantage in large scale-out storage clusters.Comment: 14 page
    • …
    corecore