10 research outputs found

    ParetoPrep: Fast computation of Path Skylines Queries

    Full text link
    Computing cost optimal paths in network data is a very important task in many application areas like transportation networks, computer networks or social graphs. In many cases, the cost of an edge can be described by various cost criteria. For example, in a road network possible cost criteria are distance, time, ascent, energy consumption or toll fees. In such a multicriteria network, a route or path skyline query computes the set of all paths having pareto optimal costs, i.e. each result path is optimal for different user preferences. In this paper, we propose a new method for computing route skylines which significantly decreases processing time and memory consumption. Furthermore, our method does not rely on any precomputation or indexing method and thus, it is suitable for dynamically changing edge costs. Our experiments demonstrate that our method outperforms state of the art approaches and allows highly efficient path skyline computation without any preprocessing.Comment: 12 pages, 9 figures, technical repor

    Sequential random sampling revisited : hidden shuffle method

    Get PDF
    Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates

    Data-independent space partitionings for summaries

    Get PDF
    Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins

    Frequency-constrained substring complexity

    Get PDF
    We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in τ\tau intervals, I=I1,,Iτ\mathcal {I}=I_1,\ldots,I_\tau , the frequency-constrained substring complexity of x is the function f:x,I(i,j)f:{x,\mathcal {I}}(i,j) that maps i, j to the number of distinct substrings of length i of x occurring at least αj\alpha _j and at most βj\beta _j times in x, where I:j=[αj,βj]I:j=[\alpha _j,\beta _j]. We extend this notion as follows. For a string x, a dictionary D\mathcal {D} of d strings (documents), and a partition of [d] in τ\tau intervals I:1,,IτI:1,\ldots,I_\tau , we define a 2D array S=S[1..x,1..τ]S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ] as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least αj\alpha _j and at most βj\beta _j documents, where I:j=[αj,βj]I:j=[\alpha _j,\beta _j]. Array S can thus be seen as the distribution of the substring complexity of x into τ\tau document frequency classes. We show that after a linear-time preprocessing of D\mathcal {D}, for any x and any partition of [d] in τ\tau intervals given online, array S can be computed in near-optimal O(xτloglogd)\mathcal {O}(|x| \tau \log \log d) time

    PGMJoins : random join sampling with graphical models

    No full text
    Modern databases face formidable challenges when called to join (several) massive tables. Joins (especially when entailing many-to-many joins) are very time- and resource-consuming, join results can be too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art leaves lots of room for improvements. With this paper we contribute a principled solution, coined PGMJoins. PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins. PGMJoins contributes optimizations both for deriving the structure of the graph and for PGM inference. It also contributes a novel Sum-Product Message Passing Algorithm (SP-MPA) to make a uniform sample of the joint distribution (join result) efficiently and a novel way to deal with cyclic joins. Despite the use of PGMs, the learned joint distribution is not approximated, and the uniform samples are drawn from the true distribution. Our experimentation using queries and datasets from TPC-H, JOB, TPC-DS, and Twitter shows PGMJoins to outperform the state of the art (by 2X-28X)

    Approximating multidimensional range counts with maximum error guarantees

    No full text
    Summarization: We address the problem of compactly approximating multidimensional range counts with a guaranteed maximum error and propose a novel histogram-based summary structure, termed SliceHist. The key idea is to operate a grid histogram in an approximately rank-transformed space, where the data points are more uniformly distributed and each grid slice contains only a small number of points. Then, the points of each slice are summarised again using the same technique. As each query box partially intersects only few slices and each grid slice has few data points, the summary is able to achieve tight error guarantees. In experiments and through analysis of non-asymptotic formulas we show that SliceHist is not only competitive with existing heuristics in terms of performance, but additionally offers tight error guarantees.Παρουσιάστηκε στο: 2021 IEEE 37th International Conference on Data Engineerin
    corecore