417,490 research outputs found

    Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

    Full text link
    We introduce simple, efficient algorithms for computing a MinHash of a probability distribution, suitable for both sparse and dense data, with equivalent running times to the state of the art for both cases. The collision probability of these algorithms is a new measure of the similarity of positive vectors which we investigate in detail. We describe the sense in which this collision probability is optimal for any Locality Sensitive Hash based on sampling. We argue that this similarity measure is more useful for probability distributions than the similarity pursued by other algorithms for weighted MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

    Stability of Mine Car Motion in Curves of Invariable and Variable Radii

    Get PDF
    We discuss our experiences adapting three recent algorithms for maximum common (connected) subgraph problems to exploit multi-core parallelism. These algorithms do not easily lend themselves to parallel search, as the search trees are extremely irregular, making balanced work distribution hard, and runtimes are very sensitive to value-ordering heuristic behaviour. Nonetheless, our results show that each algorithm can be parallelised successfully, with the threaded algorithms we create being clearly better than the sequential ones. We then look in more detail at the results, and discuss how speedups should be measured for this kind of algorithm. Because of the difficulty in quantifying an average speedup when so-called anomalous speedups (superlinear and sublinear) are common, we propose a new measure called aggregate speedup

    Prediction of Large Events on a Dynamical Model of a Fault

    Full text link
    We present results for long term and intermediate term prediction algorithms applied to a simple mechanical model of a fault. We use long term prediction methods based, for example, on the distribution of repeat times between large events to establish a benchmark for predictability in the model. In comparison, intermediate term prediction techniques, analogous to the pattern recognition algorithms CN and M8 introduced and studied by Keilis-Borok et al., are more effective at predicting coming large events. We consider the implications of several different quality functions Q which can be used to optimize the algorithms with respect to features such as space, time, and magnitude windows, and find that our results are not overly sensitive to variations in these algorithm parameters. We also study the intrinsic uncertainties associated with seismicity catalogs of restricted lengths.Comment: 33 pages, plain.tex with special macros include

    Inducing safer oblique trees without costs

    Get PDF
    Decision tree induction has been widely studied and applied. In safety applications, such as determining whether a chemical process is safe or whether a person has a medical condition, the cost of misclassification in one of the classes is significantly higher than in the other class. Several authors have tackled this problem by developing cost-sensitive decision tree learning algorithms or have suggested ways of changing the distribution of training examples to bias the decision tree learning process so as to take account of costs. A prerequisite for applying such algorithms is the availability of costs of misclassification. Although this may be possible for some applications, obtaining reasonable estimates of costs of misclassification is not easy in the area of safety. This paper presents a new algorithm for applications where the cost of misclassifications cannot be quantified, although the cost of misclassification in one class is known to be significantly higher than in another class. The algorithm utilizes linear discriminant analysis to identify oblique relationships between continuous attributes and then carries out an appropriate modification to ensure that the resulting tree errs on the side of safety. The algorithm is evaluated with respect to one of the best known cost-sensitive algorithms (ICET), a well-known oblique decision tree algorithm (OC1) and an algorithm that utilizes robust linear programming
    corecore