934 research outputs found
Cleaning data with Llunatic
Data cleaning (or data repairing) is considered a crucial problem in many database-related tasks. It consists in making a database consistent with respect to a given set of constraints. In recent years, repairing methods have been proposed for several classes of constraints. These methods, however, tend to hard-code the strategy to repair conflicting values and are specialized toward specific classes of constraints. In this paper, we develop a general chase-based repairing framework, referred to as Llunatic, in which repairs can be obtained for a large class of constraints and by using different strategies to select preferred values. The framework is based on an elegant formalization in terms of labeled instances and partially ordered preference labels. In this context, we revisit concepts such as upgrades, repairs and the chase. In Llunatic, various repairing strategies can be slotted in, without the need for changing the underlying implementation. Furthermore, Llunatic is the first data repairing system which is DBMS-based. We report experimental results that confirm its good scalability and show that various instantiations of the framework result in repairs of good quality
Tackling the veracity and variety of big data
This thesis tackles the veracity and variety challenges of big data, especially focusing
on graphs and relational data. We start with proposing a class of graph association
rules (GARs) to specify regularities between entities in graphs, which capture both
missing links and inconsistencies. A GAR is a combination of a graph pattern and a
dependency; it may take as predicates machine learning classifiers for link prediction.
We formalize association deduction with GARs in terms of the chase, and prove its
Church-Rosser property. We show that the satisfiability, implication and association
deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively.
The incremental deduction problem is DP-complete for GARs. In addition,
we provide parallel algorithms for association deduction and incremental deduction.
We next develop a parallel algorithm to discover GARs, which applies an applicationdriven
strategy to cut back rules and data that are irrelevant to usersâ interest, by training
a machine learning model to identify data pertaining to a given application. Moreover,
we introduce a sampling method to reduce a big graph G to a set H of small
sample graphs. Given expected support and recall bounds, this method is able to deduce
samples in H and mine rules from H to satisfy the bounds in the entire G.
Then we propose a class of temporal association rules (TACOs) for event prediction
in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns
and (temporal) conditions, and may carry machine learning predicates for temporal
event prediction. We settle the complexity of reasoning about TACOs, including their
satisfiability, implication and prediction problems. We develop a system that discovers
TACOs by iteratively training a rule creator based on generative models in a creatorcritic
framework, and predicts events by applying the discovered TACOs in parallel.
Finally, we propose an approach to querying relations D and graphs G taken together
in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined
to refer to the same real-world entity, then we join t and v, correlate their information
and complement tuple t with additional attributes of v from graphs. We show how to
do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple
in D and dynamic joins when t comes from intermediate results of sub-queries on D.
To support the semantic joins, we propose an attribute extraction scheme based on Kmeans
clustering, to identify and fetch graph properties that are linked to v via paths.
Moreover, we develop a scheme to extract a relation schema for entities in graphs, and
a heuristic join method based on the schema to strike a balance between the complexity
and accuracy of dynamic joins
Querying big data with bounded data access
Query answering over big data is cost-prohibitive. A linear scan of a dataset D may
take days with a solid state device if D is of PB size and years if D is of EB size. In
other words, polynomial-time (PTIME) algorithms for query evaluation are already
not feasible on big data. To tackle this, we propose querying big data with bounded
data access, such that the cost of query evaluation is independent of the scale of D.
First of all, we propose a class of boundedly evaluable queries. A query Q is boundedly
evaluable under a set A of access constraints if for any dataset D that satisfies
constraints in A, there exists a subset DQ â D such that (a) Q(DQ) = Q(D), and (b) the
time for identifying DQ from D, and hence the size |DQ| of DQ, are independent of |D|.
That is, we can compute Q(D) by accessing a bounded amount of data no matter how
big D grows.We study the problem of deciding whether a query is boundedly evaluable
under A. It is known that the problem is undecidable for FO without access constraints.
We show that, in the presence of access constraints, it is decidable in 2EXPSPACE for
positive fragments of FO queries, but is already EXPSPACE-hard even for CQ.
To handle the undecidability and high complexity of the analysis, we develop effective
syntax for boundedly evaluable queries under A, referred to as queries covered
by A, such that, (a) any boundedly evaluable query under A is equivalent to a query
covered by A, (b) each covered query is boundedly evaluable, and (c) it is efficient to
decide whether Q is covered by A. On top of DBMS, we develop practical algorithms
for checking whether queries are covered by A, and generating bounded plans if so.
For queries that are not boundedly evaluable, we extend bounded evaluability
to resource-bounded approximation and bounded query rewriting using views.
(1) Resource-bounded approximation is parameterized with a resource ratio a â (0,1],
such that for any query Q and dataset D, it computes approximate answers with an
accuracy bound h by accessing at most a|D| tuples. It is based on extended access constraints
and a new accuracy measure. (2) Bounded query rewriting tackles the problem
by incorporating bounded evaluability with views, such that the queries can be exactly
answered by accessing cached views and a bounded amount of data in D. We study the
problem of deciding whether a query has a bounded rewriting, establish its complexity
bounds, and develop effective syntax for FO queries with a bounded rewriting.
Finally, we extend bounded evaluability to graph pattern queries, by extending
access constraints to graph data. We characterize bounded evaluability for subgraph
and simulation patterns and develop practical algorithms for associated problems
Towards effective analysis of big graphs: from scalability to quality
This thesis investigates the central issues underlying graph analysis, namely, scalability
and quality.
We first study the incremental problems for graph queries, which aim to compute
the changes to the old query answer, in response to the updates to the input graph.
The incremental problem is called bounded if its cost is decided by the sizes of the
query and the changes only. No matter how desirable, however, our first results are
negative: for common graph queries such as graph traversal, connectivity, keyword
search and pattern matching, their incremental problems are unbounded. In light of
the negative results, we propose two new characterizations for the effectiveness of
incremental computation, and show that the incremental computations above can still
be effectively conducted, by either reducing the computations on big graphs to small
data, or incrementalizing batch algorithms by minimizing unnecessary recomputation.
We next study the problems with regards to improving the quality of the graphs.
To uniquely identify entities represented by vertices in a graph, we propose a class of
keys that are recursively defined in terms of graph patterns, and are interpreted with
subgraph isomorphism. As an application, we study the entity matching problem,
which is to find all pairs of entities in a graph that are identified by a given set of
keys. Although the problem is proved to be intractable, and cannot be parallelized in
logarithmic rounds, we provide two parallel scalable algorithms for it.
In addition, to catch numeric inconsistencies in real-life graphs, we extend graph
functional dependencies with linear arithmetic expressions and comparison predicates,
referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity,
since if we allow non-linear arithmetic expressions, even of degree at most 2, the
satisfiability and implication problems become undecidable. A localizable incremental
algorithm is developed to detect errors using NGDs, where the cost is determined by
small neighbors of nodes in the updates instead of the entire graph.
Finally, a rule-based method to clean graphs is proposed. We extend graph entity
dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of
ground truth, we fix violations of GEDs in the graph by combining data repairing and
object identification. The method finds certain fixes to errors detected by GEDs, i.e.,
as long as the GEDs and the ground truth are correct, the fixes are assured correct as
their logical consequences. Several fundamental results underlying the method are established,
and an algorithm is developed to implement the method. We also parallelize
the method and guarantee to reduce its running time with the increase of processors
Verification of Graph Programs
This thesis is concerned with verifying the correctness of programs written in GP 2 (for Graph Programs), an experimental, nondeterministic graph manipulation language, in which program states are graphs, and computational steps are applications of graph transformation rules. GP 2 allows for visual programming at a high level of abstraction, with the programmer freed from manipulating low-level data structures and instead solving graph-based problems in a direct, declarative, and rule-based way. To verify that a graph program meets some specification, however, has been -- prior to the work described in this thesis -- an ad hoc task, detracting from the appeal of using GP 2 to reason about graph algorithms, high-level system specifications, pointer structures, and the many other practical problems in software engineering and programming languages that can be modelled as graph problems. This thesis describes some contributions towards the challenge of verifying graph programs, in particular, Hoare logics with which correctness specifications can be proven in a syntax-directed and compositional manner.
We contribute calculi of proof rules for GP 2 that allow for rigorous reasoning about both partial correctness and termination of graph programs. These are given in an extensional style, i.e. independent of fixed assertion languages. This approach allows for the re-use of proof rules with different assertion languages for graphs, and moreover, allows for properties of the calculi to be inherited: soundness, completeness for termination, and relative completeness (for sufficiently expressive assertion languages).
We propose E-conditions as a graphical, intuitive assertion language for expressing properties of graphs -- both about their structure and labelling -- generalising the nested conditions of Habel, Pennemann, and Rensink. We instantiate our calculi with this language, explore the relationship between the decidability of the model checking problem and the existence of effective constructions for the extensional assertions, and fix a subclass of graph programs for which we have both. The calculi are then demonstrated by verifying a number of data- and structure-manipulating programs.
We explore the relationship between E-conditions and classical logic, defining translations between the former and a many-sorted predicate logic over graphs; the logic being a potential front end to an implementation of our work in a proof assistant.
Finally, we speculate on several avenues of interesting future work; in particular, a possible extension of E-conditions with transitive closure, for proving specifications involving properties about arbitrary-length paths
Synergi: A Mixed-Initiative System for Scholarly Synthesis and Sensemaking
Efficiently reviewing scholarly literature and synthesizing prior art are
crucial for scientific progress. Yet, the growing scale of publications and the
burden of knowledge make synthesis of research threads more challenging than
ever. While significant research has been devoted to helping scholars interact
with individual papers, building research threads scattered across multiple
papers remains a challenge. Most top-down synthesis (and LLMs) make it
difficult to personalize and iterate on the output, while bottom-up synthesis
is costly in time and effort. Here, we explore a new design space of
mixed-initiative workflows. In doing so we develop a novel computational
pipeline, Synergi, that ties together user input of relevant seed threads with
citation graphs and LLMs, to expand and structure them, respectively. Synergi
allows scholars to start with an entire threads-and-subthreads structure
generated from papers relevant to their interests, and to iterate and customize
on it as they wish. In our evaluation, we find that Synergi helps scholars
efficiently make sense of relevant threads, broaden their perspectives, and
increases their curiosity. We discuss future design implications for
thread-based, mixed-initiative scholarly synthesis support tools.Comment: ACM UIST'2
- âŚ