Search CORE

266 research outputs found

A Uniform Dependency Language for Improving Data Quality

Author: Fan Wenfei
Geerts Floris
Publication venue
Publication date: 01/01/2011
Field of study

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

Extending Dependencies with Conditions

Author: Bravo Loreto
Fan Wenfei
Ma Shuai
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

Semandaq: a data quality system based on conditional functional dependencies

Author: Fan Wenfei
Geerts Floris
Jia Xibei
Publication venue
Publication date: 01/01/2008
Field of study

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

A revival of integrity constraints for data cleaning

Author: Fan Wenfei
Geerts Floris
Jia Xibei
Publication venue
Publication date: 01/01/2008
Field of study

Integrity constraints, a.k.a . data dependencies, are being widely used for improving the quality of schema . Recently constraints have enjoyed a revival for improving the quality of data . The tutorial aims to provide an overview of recent advances in constraint-based data cleaning. </jats:p

Crossref

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

Performance Guarantees for Distributed Reachability Queries

Author: Fan Wenfei
Wang Xin
Wu Yinghui
Publication venue
Publication date: 01/01/2012
Field of study

In the real world a graph is often fragmented and distributed across different sites. This highlights the need for evaluating queries on distributed graphs. This paper proposes distributed evaluation algorithms for three classes of queries: reachability for determining whether one node can reach another, bounded reachability for deciding whether there exists a path of a bounded length between a pair of nodes, and regular reachability for checking whether there exists a path connecting two nodes such that the node labels on the path form a string in a given regular expression. We develop these algorithms based on partial evaluation, to explore parallel computation. When evaluating a query Q on a distributed graph G, we show that these algorithms possess the following performance guarantees, no matter how G is fragmented and distributed: (1) each site is visited only once; (2) the total network traffic is determined by the size of Q and the fragmentation of G, independent of the size of G; and (3) the response time is decided by the largest fragment of G rather than the entire G. In addition, we show that these algorithms can be readily implemented in the MapReduce framework. Using synthetic and real-life data, we experimentally verify that these algorithms are scalable on large graphs, regardless of how the graphs are distributed.Comment: VLDB201

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Making Queries Tractable on Big Data with Preprocessing

Author: Fan Wenfei
Geerts Floris
Neven Frank
Publication venue
Publication date: 01/01/2013
Field of study

A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME al-gorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to pro-vide a formal foundation for this approach in terms of com-putational complexity. (1) We propose a set of Π-tractable queries, denoted by ΠT0Q, to characterize classes of queries that can be answered in parallel poly-logarithmic time (NC) after PTIME preprocessing. (2) We show that several natu-ral query classes are Π-tractable and are feasible on big data. (3) We also study a set ΠTQ of query classes that can be ef-fectively converted to Π-tractable queries by re-factorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for ΠTQ. (5) We also show that ΠT0Q ⊂ P unless P = NC, i.e., the set ΠT0Q of all Π-tractable queries is properly contained in the set P of all PTIME queries. Nonetheless, ΠTQ = P, i.e., all PTIME query classes can be made Π-tractable via proper re-factorizations. This work is a step towards understanding the tractability of queries in the context of big data. 1

CiteSeerX

Edinburgh Research Explorer

Diversified Top-k Graph Pattern Matching

Author: Fan Wenfei
Wang Xin
Wu Yinghui
Publication venue
Publication date: 01/01/2013
Field of study

Edinburgh Research Explorer

Reasoning about Record Matching Rules

Author: Fan Wenfei
Jia Xibei
Li Jianzhong
Ma Shuai
Publication venue
Publication date: 01/01/2009
Field of study

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics . (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O ( n 2 ) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods. </jats:p

Crossref

Edinburgh Research Explorer

Putting Context into Schema Matching

Author: Bohannon Philip
Elnahrawy Eiman
Fan Wenfei
Flaster Michael
Publication venue
Publication date: 01/01/2006
Field of study

Edinburgh Research Explorer

Constraints for Semistructured Data and XML

Author: Buneman Peter
Fan Wenfei
Siméon Jérôme
Weinstein Scott
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2001
Field of study

Integrity constraints play a fundamental role in database design. We review initial work on the expression of integrity constraints for semistructured data and XML

CiteSeerX

Crossref

Edinburgh Research Explorer