13,564 research outputs found
Finding Associations and Computing Similarity via Biased Pair Sampling
This version is ***superseded*** by a full version that can be found at
http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger
theoretical results and fixes a mistake in the reporting of experiments.
Abstract: Sampling-based methods have previously been proposed for the
problem of finding interesting associations in data, even for low-support
items. While these methods do not guarantee precise results, they can be vastly
more efficient than approaches that rely on exact counting. However, for many
similarity measures no such methods have been known. In this paper we show how
a wide variety of measures can be supported by a simple biased sampling method.
The method also extends to find high-confidence association rules. We
demonstrate theoretically that our method is superior to exact methods when the
threshold for "interesting similarity/confidence" is above the average pairwise
similarity/confidence, and the average support is not too low. Our method is
particularly good when transactions contain many items. We confirm in
experiments on standard association mining benchmarks that this gives a
significant speedup on real data sets (sometimes much larger than the
theoretical guarantees). Reductions in computation time of over an order of
magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE
International Conference on Data Mining, 2009. The conference version is (c)
2009 IEE
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Estimating set similarity and detecting highly similar sets are fundamental
problems in areas such as databases, machine learning, and information
retrieval. MinHash is a well-known technique for approximating Jaccard
similarity of sets and has been successfully used for many applications such as
similarity search and large scale learning. Its two compressed versions, b-bit
MinHash and Odd Sketch, can significantly reduce the memory usage of the
original MinHash method, especially for estimating high similarities (i.e.,
similarities around 1). Although MinHash can be applied to static sets as well
as streaming sets, of which elements are given in a streaming fashion and
cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd
Sketch fail to deal with streaming data. To solve this problem, we design a
memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard
similarities in streaming sets. Compared to MinHash, our method uses smaller
sized registers (each register consists of less than 7 bits) to build a compact
sketch for each set. We also provide a simple yet accurate estimator for
inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive
formulas for bounding the estimation error and determine the smallest necessary
memory usage (i.e., the number of registers used for a MaxLogHash sketch) for
the desired accuracy. We conduct experiments on a variety of datasets, and
experimental results show that our method MaxLogHash is about 5 times more
memory efficient than MinHash with the same accuracy and computational cost for
estimating high similarities
Accurate Liability Estimation Improves Power in Ascertained Case Control Studies
Linear mixed models (LMMs) have emerged as the method of choice for
confounded genome-wide association studies. However, the performance of LMMs in
non-randomly ascertained case-control studies deteriorates with increasing
sample size. We propose a framework called LEAP (Liability Estimator As a
Phenotype, https://github.com/omerwe/LEAP) that tests for association with
estimated latent values corresponding to severity of phenotype, and demonstrate
that this can lead to a substantial power increase
Search Efficient Binary Network Embedding
Traditional network embedding primarily focuses on learning a dense vector
representation for each node, which encodes network structure and/or node
content information, such that off-the-shelf machine learning algorithms can be
easily applied to the vector-format node representations for network analysis.
However, the learned dense vector representations are inefficient for
large-scale similarity search, which requires to find the nearest neighbor
measured by Euclidean distance in a continuous vector space. In this paper, we
propose a search efficient binary network embedding algorithm called BinaryNE
to learn a sparse binary code for each node, by simultaneously modeling node
context relations and node attribute relations through a three-layer neural
network. BinaryNE learns binary node representations efficiently through a
stochastic gradient descent based online learning algorithm. The learned binary
encoding not only reduces memory usage to represent each node, but also allows
fast bit-wise comparisons to support much quicker network node search compared
to Euclidean distance or other distance measures. Our experiments and
comparisons show that BinaryNE not only delivers more than 23 times faster
search speed, but also provides comparable or better search quality than
traditional continuous vector based network embedding methods
Long-lasting, kin-directed female interactions in a spatially structured wild boar social network
We thank W. Jędrzejewski for his support and logistical help in trapping wild boar. We are grateful to R. Kozak, A. Waszkiewicz and many students and volunteers for their help with fieldwork as well as to A. N. Bunevich, T. Borowik and local hunters for providing genetic samples. Genetic analyses were performed in the laboratory of the Department of Science for Nature and Environmental Resources, University of Sassari, Italy, with the help of L. Iacolina and D. Biosa. We are grateful to K. O’Mahony who revised English and to A. Widdig, K. Langergraber and one anonymous reviewer for valuable comments on the earlier version of the manuscript.Peer reviewedPublisher PD
Detection of regulator genes and eQTLs in gene networks
Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in non-coding genomic
regions. These genetic variants are often also associated to differences in
expression levels of nearby genes (they are "expression quantitative trait
loci" or eQTLs for short) and presumably play a gene regulatory role, affecting
the status of molecular networks of interacting genes, proteins and
metabolites. Computational systems biology approaches to reconstruct causal
gene networks from large-scale omics data have therefore become essential to
understand the structure of networks controlled by eQTLs together with other
regulatory genes, and to generate detailed hypotheses about the molecular
mechanisms that lead from genotype to phenotype. Here we review the main
analytical methods and softwares to identify eQTLs and their associated genes,
to reconstruct co-expression networks and modules, to reconstruct causal
Bayesian gene and module networks, and to validate predicted networks in
silico.Comment: minor revision with typos corrected; review article; 24 pages, 2
figure
Improved Densification of One Permutation Hashing
The existing work on densification of one permutation hashing reduces the
query processing cost of the -parameterized Locality Sensitive Hashing
(LSH) algorithm with minwise hashing, from to merely ,
where is the number of nonzeros of the data vector, is the number of
hashes in each hash table, and is the number of hash tables. While that is
a substantial improvement, our analysis reveals that the existing densification
scheme is sub-optimal. In particular, there is no enough randomness in that
procedure, which affects its accuracy on very sparse datasets.
In this paper, we provide a new densification procedure which is provably
better than the existing scheme. This improvement is more significant for very
sparse datasets which are common over the web. The improved technique has the
same cost of for query processing, thereby making it strictly
preferable over the existing procedure. Experimental evaluations on public
datasets, in the task of hashing based near neighbor search, support our
theoretical findings
- …