Search CORE

417,490 research outputs found

Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

Author: Jiang Yunjiang
Moulton Ryan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/10/2018
Field of study

We introduce simple, efficient algorithms for computing a MinHash of a probability distribution, suitable for both sparse and dense data, with equivalent running times to the state of the art for both cases. The collision probability of these algorithms is a new measure of the similarity of positive vectors which we investigate in detail. We describe the sense in which this collision probability is optimal for any Locality Sensitive Hash based on sampling. We argue that this similarity measure is more useful for probability distributions than the similarity pursued by other algorithms for weighted MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201

arXiv.org e-Print Archive

Hashing for Similarity Search: A Survey

Author: Ji Jianqiu
Shen Heng Tao
Song Jingkuan
Wang Jingdong
Publication venue
Publication date: 13/08/2014
Field of study

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

arXiv.org e-Print Archive

CiteSeerX

Stability of Mine Car Motion in Curves of Invariable and Variable Radii

Author: A Bruin de
A Malapert
C Combier
C Luo
C McCreesh
C McCreesh
C McCreesh
D Conte
D Gao
DA Bader
DJ Cook
ED Dolan
G Chu
G Levi
G Li
H Bunke
HC Ehrlich
JW Raymond
L Kotthoff
M Fang
M Fernández
MD Santo
P Vismara
PS Segundo
S Gay
SN Ndiaye
T Delavallade
T Lai
T Petit
YH Park
Publication venue
Publication date: 01/01/2013
Field of study

We discuss our experiences adapting three recent algorithms for maximum common (connected) subgraph problems to exploit multi-core parallelism. These algorithms do not easily lend themselves to parallel search, as the search trees are extremely irregular, making balanced work distribution hard, and runtimes are very sensitive to value-ordering heuristic behaviour. Nonetheless, our results show that each algorithm can be parallelised successfully, with the threaded algorithms we create being clearly better than the sequential ones. We then look in more detail at the results, and discuss how speedups should be measured for this kind of algorithm. Because of the difficulty in quantifying an average speedup when so-called anomalous speedups (superlinear and sublinear) are common, we propose a new measure called aggregate speedup

eLibrary National Mining University

Enlighten

Hal-Diderot

Prediction of Large Events on a Dynamical Model of a Fault

Author: Aki
Archuletta
B. E. Shaw
Bak
Bakun
Bakun
Burridge
Carlson
Carlson
Carlson
Carlson
Chen
Davis
Davison
Dieterich
Gabrielov
Healy
J. M. Carlson
Kanamori
Keilis-Borok
Keilis-Borok
Keilis-Borok
Kiremidjian
Langer
Langer
Malin
McNalley
Mogi
Molchan
Molchan
Nishenko
Pacheco
S. L. Pepke
Scholz
Schwartz
Shaw
Shaw
Shaw
Shimazaki
Sykes
Thatcher
Vasconcelos
Ward
Wesnousky
Working Group on California Earthquake Prediction
Working Group on California Earthquake Prediction (WG-CEP)
Wyss
Publication venue: 'American Geophysical Union (AGU)'
Publication date: 10/08/1993
Field of study

We present results for long term and intermediate term prediction algorithms applied to a simple mechanical model of a fault. We use long term prediction methods based, for example, on the distribution of repeat times between large events to establish a benchmark for predictability in the model. In comparison, intermediate term prediction techniques, analogous to the pattern recognition algorithms CN and M8 introduced and studied by Keilis-Borok et al., are more effective at predicting coming large events. We consider the implications of several different quality functions Q which can be used to optimize the algorithms with respect to features such as space, time, and magnitude windows, and find that our results are not overly sensitive to variations in these algorithm parameters. We also study the intrinsic uncertainties associated with seismicity catalogs of restricted lengths.Comment: 33 pages, plain.tex with special macros include

arXiv.org e-Print Archive

Inducing safer oblique trees without costs

Author: Althoff K.
Bennett K.P.
Bennett K.P.
Berry M.
Blake C.
Bradford J.
Breiman L.
Cohen R.
Domingos P.
Elkan C.
Elomaa T.
Fan W.
Grefenstette J.
Knoll U.
Kolodner J.
Morrison D.
Norusis M.
Nunez M.
Pazzani M.
Provost F.J.
Provost F.J.
Quinlan J.R.
Quinlan J.R.
Sunil Vadera
Tan M.
Ting K.
Turney P.
Vadera S.
Publication venue: 'Wiley'
Publication date: 01/09/2005
Field of study

Decision tree induction has been widely studied and applied. In safety applications, such as determining whether a chemical process is safe or whether a person has a medical condition, the cost of misclassification in one of the classes is significantly higher than in the other class. Several authors have tackled this problem by developing cost-sensitive decision tree learning algorithms or have suggested ways of changing the distribution of training examples to bias the decision tree learning process so as to take account of costs. A prerequisite for applying such algorithms is the availability of costs of misclassification. Although this may be possible for some applications, obtaining reasonable estimates of costs of misclassification is not easy in the area of safety. This paper presents a new algorithm for applications where the cost of misclassifications cannot be quantified, although the cost of misclassification in one class is known to be significantly higher than in another class. The algorithm utilizes linear discriminant analysis to identify oblique relationships between continuous attributes and then carries out an appropriate modification to ensure that the resulting tree errs on the side of safety. The algorithm is evaluated with respect to one of the best known cost-sensitive algorithms (ICET), a well-known oblique decision tree algorithm (OC1) and an algorithm that utilizes robust linear programming

University of Salford Institutional Repository