1,128 research outputs found
Efficient Computation of Subspace Skyline over Categorical Domains
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed
the way we search for accommodation, restaurants, etc. The underlying datasets
in such applications have numerous attributes that are mostly Boolean or
Categorical. Discovering the skyline of such datasets over a subset of
attributes would identify entries that stand out while enabling numerous
applications. There are only a few algorithms designed to compute the skyline
over categorical attributes, yet are applicable only when the number of
attributes is small.
In this paper, we place the problem of skyline discovery over categorical
attributes into perspective and design efficient algorithms for two cases. (i)
In the absence of indices, we propose two algorithms, ST-S and ST-P, that
exploits the categorical characteristics of the datasets, organizing tuples in
a tree data structure, supporting efficient dominance tests over the candidate
set. (ii) We then consider the existence of widely used precomputed sorted
lists. After discussing several approaches, and studying their limitations, we
propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists.
Moreover, we further optimize TA-SKY and explore its progressive nature, making
it suitable for applications with strict interactive requirements. In addition
to the extensive theoretical analysis of the proposed algorithms, we conduct a
comprehensive experimental evaluation of the combination of real (including the
entire AirBnB data collection) and synthetic datasets to study the practicality
of the proposed algorithms. The results showcase the superior performance of
our techniques, outperforming applicable approaches by orders of magnitude
Processing Rank-Aware Queries in Schema-Based P2P Systems
ï»żEffiziente Anfragebearbeitung in Datenintegrationssystemen sowie in
P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller
Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren
Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch
aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein
globales Schema verwaltet. Anfragen an das System werden auf diesem
globalen Schema formuliert und vom Mediator bearbeitet, indem relevante
Daten von den Datenquellen transparent fĂŒr den Benutzer angefragt werden.
Aufbauend auf diesen Systemen entstanden schlieĂlich
Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An
einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren
agieren andererseits jedoch ebenso als Datenquellen. DarĂŒber hinaus sind
diese Peers autonom und können das Netzwerk jederzeit verlassen bzw.
betreten. Die potentiell riesige Datenmenge, die in einem derartigen
Netzwerk verfĂŒgbar ist, fĂŒhrt zudem in der Regel zu sehr groĂen
Anfrageergebnissen, die nur schwer zu bewÀltigen sind. Daher ist das
Bestimmen einer vollstĂ€ndigen Ergebnismenge in vielen FĂ€llen Ă€uĂerst
aufwÀndig oder sogar unmöglich. In diesen FÀllen bietet sich die
Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit
Approximationstechniken, an, da diese Operatoren lediglich diejenigen
DatensÀtze als Ergebnis ausgeben, die aufgrund nutzerdefinierter
Ranking-Funktionen am relevantesten fĂŒr den Benutzer sind. Da durch die
Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses
tatsÀchlich dem Benutzer ausgegeben wird, muss nicht zwangslÀufig die
vollstÀndige Ergebnismenge berechnet werden sondern nur der Teil, der
tatsĂ€chlich relevant fĂŒr das Endergebnis ist.
Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser
Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser
Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser
Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in
PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter
Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei
sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer
Daten geprĂŒft und dementsprechend in die Anfragebearbeitung einbezogen
oder ausgeschlossen. Durch die HeterogenitÀt der Peers werden Techniken
zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da
existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive
Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die
Anfragen mit ranking-basierten Anfrageoperatoren berĂŒcksichtigt. Da PDMSs
dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten Àndern
können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe
verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen,
sondern auch wie sie gepflegt werden können. SchlieĂlich stellen wir
SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor,
ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle
vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query
processing in data integration and P2P systems. Conventional data
integration systems consist of multiple sources with possibly different
schemas, adhere to a hierarchical structure, and have a central component
(mediator) that manages a global schema. Queries are formulated against
this global schema and the mediator processes them by retrieving relevant
data from the sources transparently to the user. Arising from these
systems, eventually Peer Data Management Systems (PDMSs), or schema-based
P2P systems respectively, have attracted attention. Peers participating in
a PDMS can act both as a mediator and as a data source, are autonomous, and
might leave or join the network at will. Due to these reasons peers often
hold incomplete or erroneous data sets and mappings. The possibly huge
amount of data available in such a network often results in large query
result sets that are hard to manage. Due to these reasons, retrieving the
complete result set is in most cases difficult or even impossible. Applying
rank-aware query operators such as top-N and skyline, possibly in
conjunction with approximation techniques, is a remedy to these problems as
these operators select only those result records that are most relevant to
the user. Being aware that in most cases only a small fraction of the
complete result set is actually output to the user, retrieving the complete
set before evaluating such operators is obviously inefficient.
Therefore, the questions we want to answer in this dissertation are how to
compute such queries in PDMSs and how to do that efficiently. We propose
strategies for efficient query processing in PDMSs that exploit the
characteristics of rank-aware queries and optionally apply approximation
techniques. A peer's relevance is determined on two levels: on schema-level
and on data-level. According to its relevance a peer is either considered
for query processing or not. Because of heterogeneity queries need to be
rewritten, enabling cooperation between peers that use different schemas.
As existing query rewriting techniques mostly consider conjunctive queries
only, we present an extension that allows for rewriting queries involving
rank-aware query operators. As PDMSs are dynamic systems and peers might
update their local data, this dissertation addresses not only the problem
of considering such structures within a query processing strategy but also
the problem of keeping them up-to-date. Finally, we provide a system-level
evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data
Management Systems) -- a system created in the context of this dissertation
implementing all presented techniques
Contributions Ă lâOptimisation de RequĂȘtes Multidimensionnelles
Analyser les donnĂ©es consiste Ă choisir un sous-ensemble des dimensions qui les dĂ©criventafin d'en extraire des informations utiles. Or, il est rare que l'on connaisse a priori les dimensions"intĂ©ressantes". L'analyse se transforme alors en une activitĂ© exploratoire oĂč chaque passe traduit par une requĂȘte. Ainsi, il devient primordiale de proposer des solutions d'optimisationde requĂȘtes qui ont une vision globale du processus plutĂŽt que de chercher Ă optimiser chaque requĂȘteindĂ©pendamment les unes des autres. Nous prĂ©sentons nos contributions dans le cadre de cette approcheexploratoire en nous focalisant sur trois types de requĂȘtes: (i) le calcul de bordures,(ii) les requĂȘtes dites OLAP (On Line Analytical Processing) dans les cubes de donnĂ©es et (iii) les requĂȘtesde prĂ©fĂ©rence type skyline
Efficient subspace skyline query based on user preference using MapReduce
Subspace skyline, as an important variant of skyline, has been widely applied for multiple-criteria decisions, business planning. With the development of mobile internet, subspace skyline query in mobile distributed environments has recently attracted considerable attention. However, efficiently obtaining the meaningful subset of skyline points in any subspace remains a challenging task in the current mobile internet. For more and more mobile applications, subspace skyline query on mobile units is usually limited by big data and wireless bandwidth. To address this issue, in this paper, we propose a system model that can support subspace skyline query in mobile distributed environment. An efficient algorithm for processing the Subspace Skyline Query using MapReduce (SSQ) is also presented which can obtain the meaningful subset of points from the full set of skyline points in any subspace. The SSQ algorithm divides a subspace skyline query into two processing phases: the preprocess phase and the query phase. The preprocess phase includes the pruning process and constructing index process which is designed to reduce network delay and response time. Additionally, the query phase provides two filtering methods, SQM-filtering and Δ-filtering, to filter the skyline points according to user preference and reduce network cost. Extensive experiments on real and synthetic data are conducted and the experimental results indicate that our algorithm is much efficient, meanwhile, the pruning strategy can further improve the efficiency of the algorithm
Recommended from our members
Complex Query Operators on Modern Parallel Architectures
Identifying interesting objects from a large data collection is a fundamental problem for multi-criteria decision making applications.In Relational Database Management Systems (RDBMS), the most popular complex query operators used to solve this type of problem are the Top-K selection operator and the Skyline operator.Top-K selection is tasked with retrieving the k-highest ranking tuples from a given relation, as determined by a user-defined aggregation function.Skyline selection retrieves those tuples with attributes offering (pareto) optimal trade-offs in a given relation.Efficient Top-K query processing entails minimizing tuple evaluations by utilizing elaborate processing schemes combined with sophisticated data structures that enable early termination.Skyline query evaluation involves supporting processing strategies which are geared towards early termination and incomparable tuple pruning.The rapid increase in memory capacity and decreasing costs have been the main drivers behind the development of main-memory database systems.Although the act of migrating query processing in-memory has created many opportunities to improve the associated query latency, attaining such improvements has been very challenging due to the growing gap between processor and main memory speeds.Addressing this limitation has been made easier by the rapid proliferation of multi-core and many-core architectures.However, their utilization in real systems has been hindered by the lack of suitable parallel algorithms that focus on algorithmic efficiency.In this thesis, we study in depth the Top-K and Skyline selection operators, in the context of emerging parallel architectures.Our ultimate goal is to provide practical guidelines for developing work-efficient algorithms suitable for parallel main memory processing.We concentrate on multi-core (CPU), many-core (GPU), and processing-in-memory architectures (PIM), developing solutions optimized for high throughout and low latency.The first part of this thesis focuses on Top-K selection, presenting the specific details of early termination algorithms that we developed specifically for parallel architectures and various types of accelerators (i.e. GPU, PIM).The second part of this thesis, concentrates on Skyline selection and the development of a massively parallel load balanced algorithm for PIM architectures.Our work consolidates performance results across different parallel architectures using synthetic and real data on variable query parameters and distributions for both of the aforementioned problems.The experimental results demonstrate several orders of magnitude better throughput and query latency, thus validating the effectiveness of our proposed solutions for the Top-K and Skyline selection operators
An Energy-Efficient Skyline Query for Massively Multidimensional Sensing Data
Cyber physical systems (CPS) sense the environment based on wireless sensor networks. The sensing data of such systems present the characteristics of massiveness and multi-dimensionality. As one of the major monitoring methods used in in safe production monitoring and disaster early-warning applications, skyline query algorithms are extensively adopted for multiple-objective decision analysis of these sensing data. With the expansion of network sizes, the amount of sensing data increases sharply. Then, how to improve the query efficiency of skyline query algorithms and reduce the transmission energy consumption become pressing and difficult to accomplish issues. Therefore, this paper proposes a new energy-efficient skyline query method for massively multidimensional sensing data. First, the method uses a node cut strategy to dynamically generate filtering tuples with little computational overhead when collecting query results instead of issuing queries with filters. It can judge the domination relationship among different nodes, remove the detected data sets of dominated nodes that are irrelevant to the query, modify the query path dynamically, and reduce the data comparison and computational overhead. The efficient dynamic filter generated by this strategy uses little non-skyline data transmission in the network, and the transmission distance is very short. Second, our method also employs the tuple-cutting strategy inside the node and generates the local cutting tuples by the sub-tree with the node itself as the root node, which will be used to cut the detected data within the nodes of the sub-tree. Therefore, it can further control the non-skyline data uploading. A large number of experimental results show that our method can quickly return an overview of the monitored area and reduce the communication overhead. Additionally, it can shorten the response time and improve the efficiency of the query
- âŠ