9,144 research outputs found

    Indexing the Earth Mover's Distance Using Normal Distributions

    Full text link
    Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-nearest neighbor (K-NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K-NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.Comment: VLDB201

    Dynamic Ordered Sets with Exponential Search Trees

    Full text link
    We introduce exponential search trees as a novel technique for converting static polynomial space search structures for ordered sets into fully-dynamic linear space data structures. This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and updating a dynamic set of n integer keys in linear space. Here searching an integer y means finding the maximum key in the set which is smaller than or equal to y. This problem is equivalent to the standard text book problem of maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein: Introduction to Algorithms, 2nd ed., MIT Press, 2001). The best previous deterministic linear space bound was O(log n/loglog n) due Fredman and Willard from STOC 1990. No better deterministic search bound was known using polynomial space. We also get the following worst-case linear space trade-offs between the number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however, not likely to be optimal. Our results are generalized to finger searching and string searching, providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for applications in subsequent paper

    Stock Picking via Nonsymmetrically Pruned Binary Decision Trees

    Get PDF
    Stock picking is the field of financial analysis that is of particular interest for many professional investors and researchers. In this study stock picking is implemented via binary classification trees. Optimal tree size is believed to be the crucial factor in forecasting performance of the trees. While there exists a standard method of tree pruning, which is based on the cost-complexity tradeoff and used in the majority of studies employing binary decision trees, this paper introduces a novel methodology of nonsymmetric tree pruning called Best Node Strategy (BNS). An important property of BNS is proven that provides an easy way to implement the search of the optimal tree size in practice. BNS is compared with the traditional pruning approach by composing two recursive portfolios out of XETRA DAX stocks. Performance forecasts for each of the stocks are provided by constructed decision trees. It is shown that BNS clearly outperforms the traditional approach according to the backtesting results and the Diebold-Mariano test for statistical significance of the performance difference between two forecasting methods.decision tree, stock picking, pruning, earnings forecasting, data mining

    Full Elite Sets for Multi-Objective Optimisation

    Get PDF
    Copyright © 2002 Springer. The final publication is available at link.springer.com5th International Conference on Adaptive Computing in Design and Manufacture (ACDM 2002), Exeter, UK, 16-18 April, 2002Multi-objective evolutionary algorithms frequently use an archive of non-dominated solutions to approximate the Pareto front. We show that the truncation of this archive to a limited number of solutions can lead to oscillating and shrinking estimates of the Pareto front. New data structures to permit efficient query and update of the full archive are proposed, and the superior quality of frontal estimates found using the full archive is illustrated on test problems

    Optimising decision trees using multi-objective particle swarm optimisation

    Get PDF
    Copyright © 2009 Springer-Verlag Berlin Heidelberg. The final publication is available at link.springer.comBook title: Swarm Intelligence for Multi-objective Problems in Data MiningSummary. Although conceptually quite simple, decision trees are still among the most popular classifiers applied to real-world problems. Their popularity is due to a number of factors – core among these is their ease of comprehension, robust performance and fast data processing capabilities. Additionally feature selection is implicit within the decision tree structure. This chapter introduces the basic ideas behind decision trees, focusing on decision trees which only consider a rule relating to a single feature at a node (therefore making recursive axis-parallel slices in feature space to form their classification boundaries). The use of particle swarm optimization (PSO) to train near optimal decision trees is discussed, and PSO is applied both in a single objective formulation (minimizing misclassification cost), and multi-objective formulation (trading off misclassification rates across classes). Empirical results are presented on popular classification data sets from the well-known UCI machine learning repository, and PSO is demonstrated as being fully capable of acting as an optimizer for trees on these problems. Results additionally support the argument that multi-objectification of a problem can improve uni-objective search in classification problems
    • …
    corecore