14,288 research outputs found

    DROP: Dimensionality Reduction Optimization for Time Series

    Full text link
    Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) is a standard tool for dimensionality reduction, but performing PCA over a full dataset can be prohibitively expensive. As a result, theoretical work has studied the effectiveness of iterative, stochastic PCA methods that operate over data samples. However, termination conditions for stochastic PCA either execute for a predetermined number of iterations, or until convergence of the solution, frequently sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after operating over small (e.g., 1%) subsamples of input data, reducing whole workload runtime. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5x over Singular-Value-Decomposition-based PCA techniques, and exceeds conventional approaches like FFT and PAA by up to 16x in end-to-end workloads

    Distributed top-k aggregation queries at large

    Get PDF
    Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network

    Query processing of spatial objects: Complexity versus Redundancy

    Get PDF
    The management of complex spatial objects in applications, such as geography and cartography, imposes stringent new requirements on spatial database systems, in particular on efficient query processing. As shown before, the performance of spatial query processing can be improved by decomposing complex spatial objects into simple components. Up to now, only decomposition techniques generating a linear number of very simple components, e.g. triangles or trapezoids, have been considered. In this paper, we will investigate the natural trade-off between the complexity of the components and the redundancy, i.e. the number of components, with respect to its effect on efficient query processing. In particular, we present two new decomposition methods generating a better balance between the complexity and the number of components than previously known techniques. We compare these new decomposition methods to the traditional undecomposed representation as well as to the well-known decomposition into convex polygons with respect to their performance in spatial query processing. This comparison points out that for a wide range of query selectivity the new decomposition techniques clearly outperform both the undecomposed representation and the convex decomposition method. More important than the absolute gain in performance by a factor of up to an order of magnitude is the robust performance of our new decomposition techniques over the whole range of query selectivity

    Developing a labelled object-relational constraint database architecture for the projection operator

    Get PDF
    Current relational databases have been developed in order to improve the handling of stored data, however, there are some types of information that have to be analysed for which no suitable tools are available. These new types of data can be represented and treated as constraints, allowing a set of data to be represented through equations, inequations and Boolean combinations of both. To this end, constraint databases were defined and some prototypes were developed. Since there are aspects that can be improved, we propose a new architecture called labelled object-relational constraint database (LORCDB). This provides more expressiveness, since the database is adapted in order to support more types of data, instead of the data having to be adapted to the database. In this paper, the projection operator of SQL is extended so that it works with linear and polynomial constraints and variables of constraints. In order to optimize query evaluation efficiency, some strategies and algorithms have been used to obtain an efficient query plan. Most work on constraint databases uses spatiotemporal data as case studies. However, this paper proposes model-based diagnosis since it is a highly potential research area, and model-based diagnosis permits more complicated queries than spatiotemporal examples. Our architecture permits the queries over constraints to be defined over different sets of variables by using symbolic substitution and elimination of variables.Ministerio de Ciencia y Tecnología DPI2006-15476-C02-0

    On trip planning queries in spatial databases

    Full text link
    In this paper we discuss a new type of query in Spatial Databases, called Trip Planning Query (TPQ). Given a set of points P in space, where each point belongs to a category, and given two points s and e, TPQ asks for the best trip that starts at s, passes through exactly one point from each category, and ends at e. An example of a TPQ is when a user wants to visit a set of different places and at the same time minimize the total travelling cost, e.g. what is the shortest travelling plan for me to visit an automobile shop, a CVS pharmacy outlet, and a Best Buy shop along my trip from A to B? The trip planning query is an extension of the well-known TSP problem and therefore is NP-hard. The difficulty of this query lies in the existence of multiple choices for each category. In this paper, we first study fast approximation algorithms for the trip planning query in a metric space, assuming that the data set fits in main memory, and give the theory analysis of their approximation bounds. Then, the trip planning query is examined for data sets that do not fit in main memory and must be stored on disk. For the disk-resident data, we consider two cases. In one case, we assume that the points are located in Euclidean space and indexed with an Rtree. In the other case, we consider the problem of points that lie on the edges of a spatial network (e.g. road network) and the distance between two points is defined using the shortest distance over the network. Finally, we give an experimental evaluation of the proposed algorithms using synthetic data sets generated on real road networks
    corecore