32 research outputs found

    Linear Sketches for Approximate Aggregate Range Queries 1,2

    No full text
    Answering aggregate queries approximately over multidimensional data is an important problem that arises naturally in many applications. An approach to the problem is to maintain a succinct (i.e. O(k) space) representation, called sketch, of the frequency distribution h of the data, and use ˆ h for answering queries. Common sketches are constructed via linear mappings of h onto a k–dimensional space, e.g. map h to its top–k Fourier/Wavelet coefficients. We call such sketches linear sketches, since ˆ h = P ∗ h for some sketching matrix P. Linear sketches have the benefit that they can be easily maintained incrementally over data streams. Sketches are typically optimized for approximating the data distribution, but not the answers to queries. In this paper, we are concerned with linear sketches that approximate well not only the data but also the answers to the aggregate queries. The quality of approximations is measured using the mean squared and relative errors (MSE and RLE). A query is represented by a column vector q such that its answer is q T h. A given set of queries can be represented by an appropriate query matrix Q. We show that the MSE for the queries is minimized when the sketching matrix used to construct a linear sketch of h has as columns the top-k eigenvectors of the query matrix Q. Further, if the quer

    Distance Measures for Effective Clustering of ARIMA Time-Series

    No full text
    Many environmental and socioeconomic time--series data can be adequately modeled using Auto-Regressive Integrated Moving Average (ARIMA) models. We call such time--series ARIMA time--series. We consider the problem of clustering ARIMA time--series. We propose the use of the Linear Predictive Coding (LPC) cepstrum of time--series for clustering ARIMA time--series, by using the Euclidean distance between the LPC cepstra of two time--series as their dissimilarity measure. We demonstrate that LPC cepstral coefficients have the desired features for accurate clustering and efficient indexing of ARIMA time--series. For example, few LPC cepstral coefficients are sufficient in order to discriminate between time--series that are modeled by different ARIMA models. In fact this approach requires fewer coefficients than traditional approaches, such as DFT and DWT. The proposed distance measure can be used for measuring the similarity between different ARIMA models as well
    corecore