Search CORE

13,564 research outputs found

Finding Associations and Computing Similarity via Biased Pair Sampling

Author: Campagna Andrea
Pagh Rasmus
Publication venue
Publication date: 01/01/2009
Field of study

This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract: Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for "interesting similarity/confidence" is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly good when transactions contain many items. We confirm in experiments on standard association mining benchmarks that this gives a significant speedup on real data sets (sometimes much larger than the theoretical guarantees). Reductions in computation time of over an order of magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE International Conference on Data Mining, 2009. The conference version is (c) 2009 IEE

arXiv.org e-Print Archive

CiteSeerX

The IT University of Copenhagen's Repository

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Author: Broder A.
Durand Marianne
Flajolet Philippe
Li Ping
Li Ping
Shrivastava Anshumali
Shrivastava Anshumali
Shrivastava Anshumali
Shrivastava Anshumali
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/05/2019
Field of study

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities

arXiv.org e-Print Archive

Crossref

Accurate Liability Estimation Improves Power in Ascertained Case Control Studies

Author: AL Price
AL Price
C Lippert
C Widmer
Christoph Lippert
D Golan
D Welter
Dan Geiger
David Heckerman
DJ Balding
ER Dempster
J Listgarten
J Yang
J Yang
J Yang
LA Hindorff
LC Tsoi
M Fakiola
N Fusi
N Patterson
N Zaitlen
N Zaitlen
Omer Weissbrod
S Sawcer
S Wright
SH Lee
X Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2015
Field of study

Linear mixed models (LMMs) have emerged as the method of choice for confounded genome-wide association studies. However, the performance of LMMs in non-randomly ascertained case-control studies deteriorates with increasing sample size. We propose a framework called LEAP (Liability Estimator As a Phenotype, https://github.com/omerwe/LEAP) that tests for association with estimated latent values corresponding to severity of phenotype, and demonstrate that this can lead to a substantial power increase

arXiv.org e-Print Archive

Crossref

MDC Repository

Search Efficient Binary Network Embedding

Author: Yin Jie
Zhang Chengqi
Zhang Daokun
Zhu Xingquan
Publication venue
Publication date: 13/01/2019
Field of study

Traditional network embedding primarily focuses on learning a dense vector representation for each node, which encodes network structure and/or node content information, such that off-the-shelf machine learning algorithms can be easily applied to the vector-format node representations for network analysis. However, the learned dense vector representations are inefficient for large-scale similarity search, which requires to find the nearest neighbor measured by Euclidean distance in a continuous vector space. In this paper, we propose a search efficient binary network embedding algorithm called BinaryNE to learn a sparse binary code for each node, by simultaneously modeling node context relations and node attribute relations through a three-layer neural network. BinaryNE learns binary node representations efficiently through a stochastic gradient descent based online learning algorithm. The learned binary encoding not only reduces memory usage to represent each node, but also allows fast bit-wise comparisons to support much quicker network node search compared to Euclidean distance or other distance measures. Our experiments and comparisons show that BinaryNE not only delivers more than 23 times faster search speed, but also provides comparable or better search quality than traditional continuous vector based network embedding methods

arXiv.org e-Print Archive

Long-lasting, kin-directed female interactions in a spatially structured wild boar social network

Author: Jędrzejewska Bogumiła
Lusseau David
Podgórski Tomasz
Scandura Massimo
Sönnichsen Leif
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/06/2014
Field of study

We thank W. Jędrzejewski for his support and logistical help in trapping wild boar. We are grateful to R. Kozak, A. Waszkiewicz and many students and volunteers for their help with fieldwork as well as to A. N. Bunevich, T. Borowik and local hunters for providing genetic samples. Genetic analyses were performed in the laboratory of the Department of Science for Nature and Environmental Resources, University of Sassari, Italy, with the help of L. Iacolina and D. Biosa. We are grateful to K. O’Mahony who revised English and to A. Widdig, K. Langergraber and one anonymous reviewer for valuable comments on the earlier version of the manuscript.Peer reviewedPublisher PD

Aberdeen University Research

Crossref

Directory of Open Access Journals

PubMed Central

UnissResearch

FigShare

Detection of regulator genes and eQTLs in gene networks

Author: A Butte
A Chatr-Aryamontri
A Clauset
A Joshi
A Joshi
A Kundaje
AA Shabalin
AJ Enright
AJ Walhout
AS Dimas
B Schwanhausser
B Zhang
B Zhang
C Cenik
CO Daub
D Koller
DA Cusanovich
DM Greenawalt
E Bonnet
E Ravasz
E Segal
EC Neto
EC Neto
EC Neto
EE Schadt
EE Schadt
EE Schadt
EE Schadt
EE Schadt
EJ Foss
F Grubert
F Yue
FA Cubillos
FW Albert
G Hemani
G Nicholson
GD Smith
GH Golub
H Foroughi Asl
H Talukdar
HN Kadarmideen
J Millstein
J Qi
J Zhu
J Zhu
J Zhu
JE Aten
JF Ayroles
JJ Faith
JL Björkegren
JS Liu
K Basso
K Qu
KG Ardlie
L Wu
LA Hindorff
LH Hartwell
LS Chen
M Ashburner
M Civelek
M Georges
M Gerstein
M Medvedovic
M Schmidt
M Scutari
MA Schaub
MB Eisen
MD Ritchie
ME Goddard
MEJ Newman
MEJ Newman
MV Rockman
MV Rockman
N Friedman
N Friedman
N Friedman
N Laird
O Stegle
P Langfelder
P Langfelder
P Langfelder
P Lu
R Sharan
R Sharan
RB Brem
RW Williams
S Lee
S Roy
S Tavazoie
SI Lee
SM Waszak
SS Rao
T Lappalainen
T Michoel
TA Manolio
TF Mackay
The ENCODE
TS Furey
VG Cheung
W Cookson
W Zhang
Y Chen
Y Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2016
Field of study

Genetic differences between individuals associated to quantitative phenotypic traits, including disease states, are usually found in non-coding genomic regions. These genetic variants are often also associated to differences in expression levels of nearby genes (they are "expression quantitative trait loci" or eQTLs for short) and presumably play a gene regulatory role, affecting the status of molecular networks of interacting genes, proteins and metabolites. Computational systems biology approaches to reconstruct causal gene networks from large-scale omics data have therefore become essential to understand the structure of networks controlled by eQTLs together with other regulatory genes, and to generate detailed hypotheses about the molecular mechanisms that lead from genotype to phenotype. Here we review the main analytical methods and softwares to identify eQTLs and their associated genes, to reconstruct co-expression networks and modules, to reconstruct causal Bayesian gene and module networks, and to validate predicted networks in silico.Comment: minor revision with typos corrected; review article; 24 pages, 2 figure

arXiv.org e-Print Archive

Crossref

Improved Densification of One Permutation Hashing

Author: Li Ping
Shrivastava Anshumali
Publication venue
Publication date: 18/06/2014
Field of study

The existing work on densification of one permutation hashing reduces the query processing cost of the

(K,L)

-parameterized Locality Sensitive Hashing (LSH) algorithm with minwise hashing, from

O(dKL)

to merely

O(d + KL)

, where

d

is the number of nonzeros of the data vector,

K

is the number of hashes in each hash table, and

L

is the number of hash tables. While that is a substantial improvement, our analysis reveals that the existing densification scheme is sub-optimal. In particular, there is no enough randomness in that procedure, which affects its accuracy on very sparse datasets. In this paper, we provide a new densification procedure which is provably better than the existing scheme. This improvement is more significant for very sparse datasets which are common over the web. The improved technique has the same cost of

O(d + KL)

for query processing, thereby making it strictly preferable over the existing procedure. Experimental evaluations on public datasets, in the task of hashing based near neighbor search, support our theoretical findings

arXiv.org e-Print Archive

CiteSeerX