Search CORE

10 research outputs found

ParetoPrep: Fast computation of Path Skylines Queries

Author: Jossé Gregor
Schubert Matthias
Shekelyan Michael
Publication venue
Publication date: 01/10/2014
Field of study

Computing cost optimal paths in network data is a very important task in many application areas like transportation networks, computer networks or social graphs. In many cases, the cost of an edge can be described by various cost criteria. For example, in a road network possible cost criteria are distance, time, ascent, energy consumption or toll fees. In such a multicriteria network, a route or path skyline query computes the set of all paths having pareto optimal costs, i.e. each result path is optimal for different user preferences. In this paper, we propose a new method for computing route skylines which significantly decreases processing time and memory consumption. Furthermore, our method does not rely on any precomputation or indexing method and thus, it is suitable for dynamically changing edge costs. Our experiments demonstrate that our method outperforms state of the art approaches and allows highly efficient path skyline computation without any preprocessing.Comment: 12 pages, 9 figures, technical repor

arXiv.org e-Print Archive

CiteSeerX

Sequential random sampling revisited : hidden shuffle method

Author: Cormode Graham
Shekelyan Michael
Publication venue
Publication date: 01/01/2021
Field of study

Random sampling (without replacement) is ubiquitously employed to obtain a representative subset of the data. Unlike common methods, sequential methods report samples in ascending order of index without keeping track of previous samples. This enables lightweight iterators that can jump directly from one sampled position to the next. Previously, sequential methods focused on drawing from the distribution of gap sizes, which requires intricate algorithms that are difficult to validate and can be slow in the worst-case. This can be avoided by a new method, the Hidden Shuffle. The name mirrors the fact that although the algorithm does not resemble shuffling, its correctness can be proven by conceptualising the sampling process as a random shuffle. The Hidden Shuffle algorithm stores just a handful of values, can be implemented in few lines of code, offers strong worst-case guarantees and is shown to be faster than state-of-the-art methods while using comparably few random variates

Warwick Research Archives Portal Repository

Data-independent space partitionings for summaries

Author: Cormode Graham
Garofalakis Minos
Shekelyan Michael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/06/2021
Field of study

Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins

Warwick Research Archives Portal Repository

Streaming weighted sampling over join queries

Author: Cormode Graham
Ma Qingzhi
Shanghooshabad A. M.
Shekelyan Michael
Triantafillou Peter
Publication venue
Publication date: 01/03/2023
Field of study

Warwick Research Archives Portal Repository

Frequency-constrained substring complexity

Author: Liu C. (Chang)
Loukides G. (Grigorios)
Pissis S. (Solon)
Shekelyan M. (Michael)
Publication venue
Publication date: 01/01/2023
Field of study

We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in

\tau

intervals,

\mathcal {I}=I_1,\ldots,I_\tau

, the frequency-constrained substring complexity of x is the function

f:{x,\mathcal {I}}(i,j)

that maps i, j to the number of distinct substrings of length i of x occurring at least

\alpha _j

and at most

\beta _j

times in x, where

I:j=[\alpha _j,\beta _j]

. We extend this notion as follows. For a string x, a dictionary

\mathcal {D}

of d strings (documents), and a partition of [d] in

\tau

intervals

I:1,\ldots,I_\tau

, we define a 2D array

S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]

as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least

\alpha _j

and at most

\beta _j

documents, where

I:j=[\alpha _j,\beta _j]

. Array S can thus be seen as the distribution of the substring complexity of x into

\tau

document frequency classes. We show that after a linear-time preprocessing of

\mathcal {D}

, for any x and any partition of [d] in

\tau

intervals given online, array S can be computed in near-optimal

\mathcal {O}(|x| \tau \log \log d)

time

VU Research Portal

CWI's Institutional Repository

King's Research Portal

Queen Mary Research Online

PGMJoins : random join sampling with graphical models

Author: Almasi Mehrdad
Kurmanji M.
Ma Q.
Shanghooshabad A. M.
Shekelyan Michael
Triantafillou Peter
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/06/2021
Field of study

Modern databases face formidable challenges when called to join (several) massive tables. Joins (especially when entailing many-to-many joins) are very time- and resource-consuming, join results can be too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art leaves lots of room for improvements. With this paper we contribute a principled solution, coined PGMJoins. PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins. PGMJoins contributes optimizations both for deriving the structure of the graph and for PGM inference. It also contributes a novel Sum-Product Message Passing Algorithm (SP-MPA) to make a uniform sample of the joint distribution (join result) efficiently and a novel way to deal with cyclic joins. Despite the use of PGMs, the learned joint distribution is not approximated, and the uniform samples are drawn from the true distribution. Our experimentation using queries and datasets from TPC-H, JOB, TPC-DS, and Twitter shows PGMJoins to outperform the state of the art (by 2X-28X)

Warwick Research Archives Portal Repository

Sparse prefix sums: Constant-time range sum queries over sparse multidimensional data cubes

Author: Abello
Agarwal
Anton Dignös
Bentley
Chazelle
Chazelle
Chun
De Berg
Fredman
Geffner
Johann Gamper
Kang
Liang
Michael Shekelyan
Poon
Riedewald
Takaoka
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Approximating multidimensional range counts with maximum error guarantees

Author: Dignös Anton 1983-(https://viaf.org/viaf/315965153)
Gamper Johann()
Garofalakis Minos(http://users.isc.tuc.gr/~mgarofalakis)
Shekelyan Michael()
Γαροφαλακης Μινως(http://users.isc.tuc.gr/~mgarofalakis)
Publication venue: Institute of Electrical and Electronics Engineers
Publication date
Field of study

Summarization: We address the problem of compactly approximating multidimensional range counts with a guaranteed maximum error and propose a novel histogram-based summary structure, termed SliceHist. The key idea is to operate a grid histogram in an approximately rank-transformed space, where the data points are more uniformly distributed and each grid slice contains only a small number of points. Then, the points of each slice are summarised again using the same technique. As each query box partially intersects only few slices and each grid slice has few data points, the summary is able to achieve tight error guarantees. In experiments and through analysis of non-asymptotic formulas we show that SliceHist is not only competitive with existing heuristics in terms of performance, but additionally offers tight error guarantees.Παρουσιάστηκε στο: 2021 IEEE 37th International Conference on Data Engineerin

Institutional Repository of the Technical University of Crete

Diverse nearest neighbors queries using linear skylines

Author: Anoop Jain
B Carterette
Camila F. Costa
KC Lee
Mario A. Nascimento
Matthias Schubert
Michael Shekelyan
O Kucuktunc
Y Gu
Y Tao
Z Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref