417,490 research outputs found
Maximally Consistent Sampling and the Jaccard Index of Probability Distributions
We introduce simple, efficient algorithms for computing a MinHash of a
probability distribution, suitable for both sparse and dense data, with
equivalent running times to the state of the art for both cases. The collision
probability of these algorithms is a new measure of the similarity of positive
vectors which we investigate in detail. We describe the sense in which this
collision probability is optimal for any Locality Sensitive Hash based on
sampling. We argue that this similarity measure is more useful for probability
distributions than the similarity pursued by other algorithms for weighted
MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201
Hashing for Similarity Search: A Survey
Similarity search (nearest neighbor search) is a problem of pursuing the data
items whose distances to a query item are the smallest from a large database.
Various methods have been developed to address this problem, and recently a lot
of efforts have been devoted to approximate search. In this paper, we present a
survey on one of the main solutions, hashing, which has been widely studied
since the pioneering work locality sensitive hashing. We divide the hashing
algorithms two main categories: locality sensitive hashing, which designs hash
functions without exploring the data distribution and learning to hash, which
learns hash functions according the data distribution, and review them from
various aspects, including hash function design and distance measure and search
scheme in the hash coding space
Stability of Mine Car Motion in Curves of Invariable and Variable Radii
We discuss our experiences adapting three recent algorithms for maximum common (connected) subgraph problems to exploit multi-core parallelism. These algorithms do not easily lend themselves to parallel search, as the search trees are extremely irregular, making balanced work distribution hard, and runtimes are very sensitive to value-ordering heuristic behaviour. Nonetheless, our results show that each algorithm can be parallelised successfully, with the threaded algorithms we create being clearly better than the sequential ones. We then look in more detail at the results, and discuss how speedups should be measured for this kind of algorithm. Because of the difficulty in quantifying an average speedup when so-called anomalous speedups (superlinear and sublinear) are common, we propose a new measure called aggregate speedup
Prediction of Large Events on a Dynamical Model of a Fault
We present results for long term and intermediate term prediction algorithms
applied to a simple mechanical model of a fault. We use long term prediction
methods based, for example, on the distribution of repeat times between large
events to establish a benchmark for predictability in the model. In comparison,
intermediate term prediction techniques, analogous to the pattern recognition
algorithms CN and M8 introduced and studied by Keilis-Borok et al., are more
effective at predicting coming large events. We consider the implications of
several different quality functions Q which can be used to optimize the
algorithms with respect to features such as space, time, and magnitude windows,
and find that our results are not overly sensitive to variations in these
algorithm parameters. We also study the intrinsic uncertainties associated with
seismicity catalogs of restricted lengths.Comment: 33 pages, plain.tex with special macros include
Inducing safer oblique trees without costs
Decision tree induction has been widely studied and applied. In safety applications, such as determining whether a chemical process is safe or whether a person has a medical condition, the cost of misclassification in one of the classes is significantly higher than in the other class. Several authors have tackled this problem by developing cost-sensitive decision tree learning algorithms or have suggested ways of changing the
distribution of training examples to bias the decision tree learning process so as to take account of costs. A prerequisite for applying such algorithms is the availability of costs of misclassification.
Although this may be possible for some applications, obtaining reasonable estimates of costs of misclassification is not easy in the area of safety.
This paper presents a new algorithm for applications where the cost of misclassifications cannot be quantified, although the cost of misclassification in one class is known to be significantly higher than in another class. The algorithm utilizes linear discriminant analysis to identify oblique relationships between continuous attributes and then carries out an appropriate modification to ensure that the resulting tree errs on the side of safety. The algorithm is evaluated with respect to one of the best known cost-sensitive algorithms (ICET), a well-known oblique decision tree algorithm (OC1) and an algorithm that utilizes robust linear programming
- …