1,143 research outputs found
Kernelized Hashcode Representations for Relation Extraction
Kernel methods have produced state-of-the-art results for a number of NLP
tasks such as relation extraction, but suffer from poor scalability due to the
high cost of computing kernel similarities between natural language structures.
A recently proposed technique, kernelized locality-sensitive hashing (KLSH),
can significantly reduce the computational cost, but is only applicable to
classifiers operating on kNN graphs. Here we propose to use random subspaces of
KLSH codes for efficiently constructing an explicit representation of NLP
structures suitable for general classification methods. Further, we propose an
approach for optimizing the KLSH model for classification problems by
maximizing an approximation of mutual information between the KLSH codes
(feature vectors) and the class labels. We evaluate the proposed approach on
biomedical relation extraction datasets, and observe significant and robust
improvements in accuracy w.r.t. state-of-the-art classifiers, along with
drastic (orders-of-magnitude) speedup compared to conventional kernel methods.Comment: To appear in the proceedings of conference, AAAI-1
Efficient attributed network embedding via recursive randomized hashing
© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. Attributed network embedding aims to learn a low-dimensional representation for each node of a network, considering both attributes and structure information of the node. However, the learning based methods usually involve substantial cost in time, which makes them impractical without the help of a powerful workhorse. In this paper, we propose a simple yet effective algorithm, named NetHash, to solve this problem only with moderate computing capacity. NetHash employs the randomized hashing technique to encode shallow trees, each of which is rooted at a node of the network. The main idea is to efficiently encode both attributes and structure information of each node by recursively sketching the corresponding rooted tree from bottom (i.e., the predefined highest-order neighboring nodes) to top (i.e., the root node), and particularly, to preserve as much information closer to the root node as possible. Our extensive experimental results show that the proposed algorithm, which does not need learning, runs significantly faster than the state-of-the-art learning-based network embedding methods while achieving competitive or even better performance in accuracy
Hierarchical Locality Sensitive Hashing for Structured Data: A Survey
Data similarity (or distance) computation is a fundamental research topic
which fosters a variety of similarity-based machine learning and data mining
applications. In big data analytics, it is impractical to compute the exact
similarity of data instances due to high computational cost. To this end, the
Locality Sensitive Hashing (LSH) technique has been proposed to provide
accurate estimators for various similarity measures between sets or vectors in
an efficient manner without the learning process. Structured data (e.g.,
sequences, trees and graphs), which are composed of elements and relations
between the elements, are commonly seen in the real world, but the traditional
LSH algorithms cannot preserve the structure information represented as
relations between elements. In order to conquer the issue, researchers have
been devoted to the family of the hierarchical LSH algorithms. In this paper,
we explore the present progress of the research into hierarchical LSH from the
following perspectives: 1) Data structures, where we review various
hierarchical LSH algorithms for three typical data structures and uncover their
inherent connections; 2) Applications, where we review the hierarchical LSH
algorithms in multiple application scenarios; 3) Challenges, where we discuss
some potential challenges as future directions
Load thresholds for cuckoo hashing with overlapping blocks
Dietzfelbinger and Weidling [DW07] proposed a natural variation of cuckoo
hashing where each of objects is assigned intervals of size
in a linear (or cyclic) hash table of size and both start points are chosen
independently and uniformly at random. Each object must be placed into a table
cell within its intervals, but each cell can only hold one object. Experiments
suggested that this scheme outperforms the variant with blocks in which
intervals are aligned at multiples of . In particular, the load threshold
is higher, i.e. the load that can be achieved with high probability. For
instance, Lehman and Panigrahy [LP09] empirically observed the threshold for
to be around as compared to roughly using blocks.
They managed to pin down the asymptotics of the thresholds for large ,
but the precise values resisted rigorous analysis.
We establish a method to determine these load thresholds for all , and, in fact, for general . For instance, for we
get . The key tool we employ is an insightful and general
theorem due to Leconte, Lelarge, and Massouli\'e [LLM13], which adapts methods
from statistical physics to the world of hypergraph orientability. In effect,
the orientability thresholds for our graph families are determined by belief
propagation equations for certain graph limits. As a side note we provide
experimental evidence suggesting that placements can be constructed in linear
time with loads close to the threshold using an adapted version of an algorithm
by Khosla [Kho13]
Dense peelable random uniform hypergraphs
We describe a new family of k-uniform hypergraphs with independent random edges. The hypergraphs have a high probability of being peelable, i.e. to admit no sub-hypergraph of minimum degree 2, even when the edge density (number of edges over vertices) is close to 1.
In our construction, the vertex set is partitioned into linearly arranged segments and each edge is incident to random vertices of k consecutive segments. Quite surprisingly, the linear geometry allows our graphs to be peeled "from the outside in". The density thresholds f_k for peelability of our hypergraphs (f_3 ~ 0.918, f_4 ~ 0.977, f_5 ~ 0.992, ...) are well beyond the corresponding thresholds (c_3 ~ 0.818, c_4 ~ 0.772, c_5 ~ 0.702, ...) of standard k-uniform random hypergraphs.
To get a grip on f_k, we analyse an idealised peeling process on the random weak limit of our hypergraph family. The process can be described in terms of an operator on [0,1]^Z and f_k can be linked to thresholds relating to the operator. These thresholds are then tractable with numerical methods.
Random hypergraphs underlie the construction of various data structures based on hashing, for instance invertible Bloom filters, perfect hash functions, retrieval data structures, error correcting codes and cuckoo hash tables, where inputs are mapped to edges using hash functions. Frequently, the data structures rely on peelability of the hypergraph or peelability allows for simple linear time algorithms. Memory efficiency is closely tied to edge density while worst and average case query times are tied to maximum and average edge size.
To demonstrate the usefulness of our construction, we used our 3-uniform hypergraphs as a drop-in replacement for the standard 3-uniform hypergraphs in a retrieval data structure by Botelho et al. [Fabiano Cupertino Botelho et al., 2013]. This reduces memory usage from 1.23m bits to 1.12m bits (m being the input size) with almost no change in running time. Using k > 3 attains, at small sacrifices in running time, further improvements to memory usage
- …