8 research outputs found

    Parallel processing can be harmful: The unusual behavior of interpolation search

    Get PDF
    AbstractSeveral articles have noted the usefulness of a retrieval algorithm called sequential interpolation search, and Yao and Yao have proven a lower bound log logN−O(1), showing this algorithm is actually optimal up to an additive constant on unindexed files of sizeNgenerated by the uniform probability distribution. We generalize the latter to show log logN− log logP−O(1) lower bounds the complexity of any retrieval algorithm withPparallel processors for searching an unindexed file of sizeN. This result is surprising because we also show how to obtain an upper bound that matches the lower bound up to an additive constant with a procedure that actually usesno parallel processingoutside its last iteration (at which time our proposal turns onPprocessors in parallel). Our first theorem therefore states thatparallel processing before the literally last iterationin the search of an unindexed ordered file hasnearly no usefulness. Two further surprising facts are that the preceding result holds even when communication between the parallel processing units involvesno delayand that the parallel algorithms are actuallyinherently slowerthan their sequential counterparts when each invocation of the SIMD machine invokes a communication step withany typeof nonzerodelay. The presentation in the first two chapters of this paper is quite informal, so that the reader can quickly grasp the underlying intuition

    Random input helps searching predecessors

    Get PDF
    A data structure problem consists of the finite sets: D of data, Q of queries, A of query answers, associated with a function f: D x Q → A. The data structure of file X is "static" ("dynamic") if we "do not" ("do") require quick updates as X changes. An important goal is to compactly encode a file X ϵ D, such that for each query y ϵ Q, function f (X, y) requires the minimum time to compute an answer in A. This goal is trivial if the size of D is large, since for each query y ϵ Q, it was shown that f(X,y) requires O(1) time for the most important queries in the literature. Hence, this goal becomes interesting to study as a trade off between the "storage space" and the "query time", both measured as functions of the file size n = \X\. The ideal solution would be to use linear O(n) = O(\X\) space, while retaining a constant O(1) query time. However, if f (X, y) computes the static predecessor search (find largest x ϵ X: x ≤ y), then Ajtai [Ajt88] proved a negative result. By using just n0(1) = [IX]0(1) data space, then it is not possible to evaluate f(X,y) in O(1) time Ay ϵ Q. The proof exhibited a bad distribution of data D, such that Ey∗ ϵ Q (a "difficult" query y∗), that f(X,y∗) requires ω(1) time. Essentially [Ajt88] is an existential result, resolving the worst case scenario. But, [Ajt88] left open the question: do we typically, that is, with high probability (w.h.p.)1 encounter such "difficult" queries y ϵ Q, when assuming reasonable distributions with respect to (w.r.t.) queries and data? Below we make reasonable assumptions w.r.t. the distribution of the queries y ϵ Q, as well as w.r.t. the distribution of data X ϵ D. In two interesting scenarios studied in the literature, we resolve the typical (w.h.p.) query time

    Suffix Arrays with a Twist

    Get PDF
    The suffix array is a classic full-text index, combining effectiveness with simplicity. We discuss three approaches aiming to improve its efficiency even more: changes to the navigation, data layout and adding extra data. In short, we show that i) the way how we search for the right interval boundary impacts significantly the overall search speed, ii) a B-tree data layout easily wins over the standard one, iii) the well-known idea of a lookup table for the prefixes of the suffixes can be refined with using compression, iv) caching prefixes of the suffixes in a helper array can pose another practical space-time tradeoff

    Efficient Algorithms for Similarity and Skyline Summary on Multidimensional Datasets.

    Full text link
    Efficient management of large multidimensional datasets has attracted much attention in the database research community. Such large multidimensional datasets are common and efficient algorithms are needed for analyzing these data sets for a variety of applications. In this thesis, we focus our study on two very common classes of analysis: similarity and skyline summarization. We first focus on similarity when one of the dimensions in the multidimensional dataset is temporal. We then develop algorithms for evaluating skyline summaries effectively for both temporal and low-cardinality attribute domain datasets and propose different methods for improving the effectiveness of the skyline summary operation. This thesis begins by studying similarity measures for time-series datasets and efficient algorithms for time-series similarity evaluation. The first contribution of this thesis is a new algorithm which can be used to evaluate similarity methods whose matching criteria is bounded by a specified threshold value. The second contribution of this thesis is the development of a new time-interval skyline operator, which continuously computes the current skyline over a data stream. We present a new algorithm called LookOut for evaluating such queries efficiently, and empirically demonstrate the scalability of this algorithm. Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution. The third contribution of this thesis is a novel technique called the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from low-cardinality domains. The utility of the skyline as a data summarization technique is often diminished by the volume of points in the skyline The final contribution of this thesis is a novel scheme which remedies the skyline volume problem by ranking the elements of the skyline based on their importance to the skyline summary. Collectively, the techniques described in this thesis present efficient methods for two common and computationally intensive analysis operations on large multidimensional datasets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57643/2/mmorse_1.pd

    References, Appendices & All Parts Merged

    Get PDF
    Includes: Appendix MA: Selected Mathematical Formulas; Appendix CA: Selected Physical Constants; References; EGP merged file (all parts, appendices, and references)https://commons.library.stonybrook.edu/egp/1007/thumbnail.jp