10 research outputs found
ParetoPrep: Fast computation of Path Skylines Queries
Computing cost optimal paths in network data is a very important task in many
application areas like transportation networks, computer networks or social
graphs. In many cases, the cost of an edge can be described by various cost
criteria. For example, in a road network possible cost criteria are distance,
time, ascent, energy consumption or toll fees. In such a multicriteria network,
a route or path skyline query computes the set of all paths having pareto
optimal costs, i.e. each result path is optimal for different user preferences.
In this paper, we propose a new method for computing route skylines which
significantly decreases processing time and memory consumption. Furthermore,
our method does not rely on any precomputation or indexing method and thus, it
is suitable for dynamically changing edge costs. Our experiments demonstrate
that our method outperforms state of the art approaches and allows highly
efficient path skyline computation without any preprocessing.Comment: 12 pages, 9 figures, technical repor
Sequential random sampling revisited : hidden shuffle method
Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates
Data-independent space partitionings for summaries
Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins
Frequency-constrained substring complexity
We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in intervals, , the frequency-constrained substring complexity of x is the function that maps i, j to the number of distinct substrings of length i of x occurring at least and at most times in x, where . We extend this notion as follows. For a string x, a dictionary of d strings (documents), and a partition of [d] in intervals , we define a 2D array as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least and at most documents, where . Array S can thus be seen as the distribution of the substring complexity of x into document frequency classes. We show that after a linear-time preprocessing of , for any x and any partition of [d] in intervals given online, array S can be computed in near-optimal time
PGMJoins : random join sampling with graphical models
Modern databases face formidable challenges when called to join (several) massive tables. Joins (especially when entailing many-to-many joins) are very time- and resource-consuming, join results can be too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art leaves lots of room for improvements. With this paper we contribute a principled solution, coined PGMJoins. PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins. PGMJoins contributes optimizations both for deriving the structure of the graph and for PGM inference. It also contributes a novel Sum-Product Message Passing Algorithm (SP-MPA) to make a uniform sample of the joint distribution (join result) efficiently and a novel way to deal with cyclic joins. Despite the use of PGMs, the learned joint distribution is not approximated, and the uniform samples are drawn from the true distribution. Our experimentation using queries and datasets from TPC-H, JOB, TPC-DS, and Twitter shows PGMJoins to outperform the state of the art (by 2X-28X)
Approximating multidimensional range counts with maximum error guarantees
Summarization: We address the problem of compactly approximating multidimensional range counts with a guaranteed maximum error and propose a novel histogram-based summary structure, termed SliceHist. The key idea is to operate a grid histogram in an approximately rank-transformed space, where the data points are more uniformly distributed and each grid slice contains only a small number of points. Then, the points of each slice are summarised again using the same technique. As each query box partially intersects only few slices and each grid slice has few data points, the summary is able to achieve tight error guarantees. In experiments and through analysis of non-asymptotic formulas we show that SliceHist is not only competitive with existing heuristics in terms of performance, but additionally offers tight error guarantees.Παρουσιάστηκε στο: 2021 IEEE 37th International Conference on Data Engineerin