44 research outputs found
The State of the Art in Multilayer Network Visualization
Modelling relationships between entities in real-world systems with a simple
graph is a standard approach. However, reality is better embraced as several
interdependent subsystems (or layers). Recently the concept of a multilayer
network model has emerged from the field of complex systems. This model can be
applied to a wide range of real-world datasets. Examples of multilayer networks
can be found in the domains of life sciences, sociology, digital humanities and
more. Within the domain of graph visualization there are many systems which
visualize datasets having many characteristics of multilayer graphs. This
report provides a state of the art and a structured analysis of contemporary
multilayer network visualization, not only for researchers in visualization,
but also for those who aim to visualize multilayer networks in the domain of
complex systems, as well as those developing systems across application
domains. We have explored the visualization literature to survey visualization
techniques suitable for multilayer graph visualization, as well as tools,
tasks, and analytic techniques from within application domains. This report
also identifies the outstanding challenges for multilayer graph visualization
and suggests future research directions for addressing them
The State of the Art in Multilayer Network Visualization
Modelling relationship between entities in real-world systems with a simple graph is a standard approach. However, realityis better embraced as several interdependent subsystems (or layers). Recently, the concept of a multilayer network model hasemerged from the field of complex systems. This model can be applied to a wide range of real-world data sets. Examples ofmultilayer networks can be found in the domains of life sciences, sociology, digital humanities and more. Within the domainof graph visualization, there are many systems which visualize data sets having many characteristics of multilayer graphs.This report provides a state of the art and a structured analysis of contemporary multilayer network visualization, not only forresearchers in visualization, but also for those who aim to visualize multilayer networks in the domain of complex systems, as wellas those developing systems across application domains. We have explored the visualization literature to survey visualizationtechniques suitable for multilayer graph visualization, as well as tools, tasks and analytic techniques from within applicationdomains. This report also identifies the outstanding challenges for multilayer graph visualization and suggests future researchdirections for addressing them
Equivalence of Queries with Nested Aggregation
Query equivalence is a fundamental problem within database theory. The correctness of all forms of logical query rewritingâjoin minimization, view flattening, rewriting over materialized views, various semantic optimizations that exploit schema dependencies, federated query processing and other forms of data integrationârequires proving that the final executed query is equivalent to the original user query. Hence, advances in the theory of query equivalence enable advances in query processing and optimization.
In this thesis we address the problem of deciding query equivalence between conjunctive SQL queries containing aggregation operators that may be nested. Our focus is on understanding the interaction between nested aggregation operators and the other parts of the query body, and so we model aggregation functions simply as abstract collection constructors. Hence, the precise language that we study is a conjunctive algebraic language that constructs complex objects from databases of flat relations. Using an encoding of complex objects as flat relations, we reduce the query equivalence problem for this algebraic language to deciding equivalence between relational encodings output by traditional conjunctive queries (not containing aggregation). This encoding-equivalence cleanly unifies and generalizes previous results for deciding equivalence of conjunctive queries evaluated under various processing semantics. As part of our study of aggregation operators that can construct empty sub-collectionsâso-called âscalarâ aggregationâwe consider query equivalence for conjunctive queries extended with a left outer join operator, a very practical class of queries for which the general equivalence problem has never before been analyzed. Although we do not completely solve the equivalence problem for queries with outer joins or with scalar aggregation, we do propose useful sufficient conditions that generalize previously known results for restricted classes of queries. Overall, this thesis offers new insight into the fundamental principles governing the behaviour of nested aggregation
Scalability considerations for multivariate graph visualization
Real-world, multivariate datasets are frequently too large to show in their entirety on a visual display. Still, there are many techniques we can employ to show useful partial views-sufficient to support incremental exploration of large graph datasets. In this chapter, we first explore the cognitive and architectural limitations which restrict the amount of visual bandwidth available to multivariate graph visualization approaches. These limitations afford several design approaches, which we systematically explore. Finally, we survey systems and studies that exhibit these design strategies to mitigate these perceptual and architectural limitations
Recording and Replaying User Intentions in Coordinated Multiple View Visualizations
Visualization tools help people gain insights into data. Analysts often want to revisit and review previously visited visualization states to make sense of their previous observations. Browsing a history of visualization interactions is often useful. Traditionally, history mechanisms are based on either undo-redo or replay of low-level keyboard and mouse interactions. In this thesis, we examine the feasibility of translating low-level interactions into high level user intentions for the purpose of recording user actions at a semantic level. Yi, et al. taxonomize low-level interactions into seven higher level user intents: Select, Connect, Encode, Filter, Explore, Reconfigure, and Abstract/Elaborate. Our hypothesis is that a rule-based system can translate low-level mechanical interactions into user intentions under the Yi taxonomy.
Many visualizations are designed around the data state model, in which visualizations are composed of parameters, operators, datasets, and views. Dependencies between these objects define a coordination query graph. When a user interacts with a visualization, these objects get modified in particular sequences to process, render, and display the information. Our core idea is to define rules that map the activity in sets of connected objects in a visualization coordination graph into corresponding user intentions. By dissecting existing visualization designs, we identified and characterized distinct mapping functions for each type of intention. We then collected these functions in a set of rules for deducing user intentions.
Based on the identified mapping functions, we implemented a rule system as a new capability in the Improvise visualization environment, for discerning user intentions behind user interactions. User intentions detected by the rule system are recorded in an automatically generated data set to allow a user to revisit earlier visualization states. We designed a user interface to let the user query the intent data set and restore and replay past visualization states. Finally, we assessed the utility of the system for performing queries and replaying visualization history at the level of intentions
Efficient Source Selection For SPARQL Endpoint Query Federation
The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of linked and distributed datasets from multiple domains. Due to the decentralised architecture of the Web of Data, several of these datasets contain complementary data. Running complex queries on this compendium thus often requires accessing data from different data sources within one query. The abundance of datasets and the need for running complex query has thus motivated a considerable body of work on SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data.
This thesis addresses two key areas of federated SPARQL query processing: (1) efficient source selection, and (2) comprehensive SPARQL benchmarks to test and ranked federated SPARQL engines as well as triple stores.
Efficient Source Selection: Efficient source selection is one of the most important optimization steps in federated SPARQL query processing. An overestimation of query relevant data sources increases the network traffic, result in irrelevant intermediate results, and can significantly affect the overall query processing time. Previous works have focused on generating optimized query execution plans for fast result retrieval. However, devising source selection approaches beyond triple pattern-wise source selection has not received much attention. Similarly, only little attention has been paid to the effect of duplicated data on federated querying. This thesis presents HiBISCuS and TBSS, novel hypergraph-based source selection approaches, and DAW, a duplicate-aware source selection approach to federated querying over the Web of Data. Each of these approaches can be combined directly with existing SPARQL query federation engines to achieve the same recall while querying fewer data sources. We combined the three (HiBISCuS, DAW, and TBSS) source selections approaches with query rewriting to form a complete SPARQL query federation engine named Quetsal. Furthermore, we present TopFed, a Cancer Genome Atlas (TCGA) tailored federated query processing engine that exploits the data distribution to perform intelligent source selection while querying over large TCGA SPARQL endpoints. Finally, we address the issue of rights managements and privacy while accessing sensitive resources. To this end, we present SAFE: a global source selection approach that enables decentralised, policy-aware access to sensitive clinical information represented as distributed RDF Data Cubes.
Comprehensive SPARQL Benchmarks: Benchmarking is indispensable when aiming to assess technologies with respect to their suitability for given tasks. While several benchmarks and benchmark generation frameworks have been developed to evaluate federated SPARQL engines and triple stores, they mostly provide a one-fits-all solution to the benchmarking problem. This approach to benchmarking is however unsuitable to evaluate the performance of a triple store for a given application with particular requirements. The fitness of current SPARQL query federation approaches for real applications is difficult to evaluate with current benchmarks as current benchmarks are either synthetic or too small in size and complexity. Furthermore, state-of-the-art federated SPARQL benchmarks mostly focused on a single performance criterion, i.e., the overall query runtime. Thus, they cannot provide a fine-grained evaluation of the systems. We address these drawbacks by presenting FEASIBLE, an automatic approach for the generation of benchmarks out of the query history of applications, i.e., query logs and LargeRDFBench, a billion-triple benchmark for SPARQL query federation which encompasses real data as well as real queries pertaining to real bio-medical use cases.
Our evaluation results show that HiBISCuS, TBSS, TopFed, DAW, and SAFE all can significantly reduce the total number of sources selected and thus improve the overall query performance. In particular, TBSS is the first source selection approach to remain under 5% overall relevant sources overestimation. Quetsal has reduced the number of sources selected (without losing recall), the source selection time as well as the overall query runtime as compared to state-of-the-art federation engines. The LargeRDFBench evaluation results suggests that the performance of current SPARQL query federation systems on simple queries does not reflect the systems\\\'' performance on more complex queries. Moreover, current federation systems seem unable to deal with many of the challenges that await them in the age of Big Data. Finally, the FEASIBLE\\\''s evaluation results shows that it generates better sample queries than the state-of-the-art. In addition, the better query selection and the larger set of query types used lead to triple store rankings which partly differ from the rankings generated by previous works
Decision making under uncertainty
Almost all important decision problems are inevitably subject to some level of uncertainty either about data measurements, the parameters, or predictions describing future evolution. The significance of handling uncertainty is further amplified by the large volume of uncertain data automatically generated by modern data gathering or integration systems. Various types of problems of decision making under uncertainty have been subject to extensive research in computer science, economics and social science. In this dissertation, I study three major problems in this context, ranking, utility maximization, and matching, all involving uncertain datasets.
First, we consider the problem of ranking and top-k query processing over probabilistic datasets. By illustrating the diverse and conflicting behaviors of the prior proposals, we contend that a single, specific ranking function may not suffice for probabilistic datasets. Instead we propose the notion of parameterized ranking functions, that generalize or can approximate many of the previously proposed ranking functions. We present novel exact or approximate algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations or the probability distributions are continuous.
The second problem concerns with the stochastic versions of a broad class of combinatorial optimization problems. We observe that the expected value is inadequate in capturing different types of risk-averse or risk-prone behaviors, and instead we consider a more general objective which is to maximize the expected utility of the solution for some given utility function. We present a polynomial time approximation algorithm with additive error ε for any ε > 0, under certain conditions. Our result generalizes and improves several prior results on stochastic shortest path, stochastic spanning tree, and stochastic knapsack.
The third is the stochastic matching problem which finds interesting applications in online dating, kidney exchange and online ad assignment. In this problem, the existence of each edge is uncertain and can be only found out by probing the edge. The goal is to design a probing strategy to maximize the expected weight of the matching. We give linear programming based constant-factor approximation algorithms for weighted stochastic matching, which answer an open question raised in prior work
A new Nested Graph Model for Data Integration
Despite graph data gained increasing interest in several fields, no data model suitable for both querying and integrating differently structured graph and (semi)structured data has been currently conceived. The lack of operators allowing combinations of (multiple) graphs in current graph query languages (graph joins), and on graph data structure allowing neither data integration nor nested multidimensional representations (graph nesting) are a possible motivation. In order to make such data integration possible, this thesis proposes a novel model (General Semistructured data Model) allowing the representation of both graphs and arbitrarily nested contents (e.g., one node can be contained by more than just
one parent node), thus allowing the definition of a nested graph model, where both vertices and edges may include (overlapping) graphs.
We provide two graph joins algorithms (Graph Conjunctive Equijoin Algorithm and Graph Conjunctive Less-equal Algorithm) and one graph nesting algorithm (Two HOp Separated Patterns). Their evaluation on top of our secondary memory representation showed the inefficiency of existing query languagesâ query plan on top of their respective data models (relational, graph and document-oriented). In all three algorithms, the enhancement was possible by using an adjacency list graph representation, thus reducing the cost of joining the vertices with their respective outgoing (or ingoing) edges, and by associating hash values to both vertices and edges.
As a secondary outcome of this thesis, a general data integration scenario is provided where both graph data and other semistructured and structured data could be represented and integrated into the General Semistructured data Model. A new query language outlines the feasibility of this approach (General Semistructured Query Language) over the former data model, also allowing to express both graph joins and graph nestings. This language is also capable of representing both traversal and data manipulation operators
Scalability considerations for multivariate graph visualization
Real-world, multivariate datasets are frequently too large to show in their entirety on a visual display. Still, there are many techniques we can employ to show useful partial views-sufficient to support incremental exploration of large graph datasets. In this chapter, we first explore the cognitive and architectural limitations which restrict the amount of visual bandwidth available to multivariate graph visualization approaches. These limitations afford several design approaches, which we systematically explore. Finally, we survey systems and studies that exhibit these design strategies to mitigate these perceptual and architectural limitations