49 research outputs found
A Labeled Graph Kernel for Relationship Extraction
In this paper, we propose an approach for Relationship Extraction (RE) based
on labeled graph kernels. The kernel we propose is a particularization of a
random walk kernel that exploits two properties previously studied in the RE
literature: (i) the words between the candidate entities or connecting them in
a syntactic representation are particularly likely to carry information
regarding the relationship; and (ii) combining information from distinct
sources in a kernel may help the RE system make better decisions. We performed
experiments on a dataset of protein-protein interactions and the results show
that our approach obtains effectiveness values that are comparable with the
state-of-the art kernel methods. Moreover, our approach is able to outperform
the state-of-the-art kernels when combined with other kernel methods
On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research
Scientific research requires access, analysis, and sharing of data that is
distributed across various heterogeneous data sources at the scale of the
Internet. An eager ETL process constructs an integrated data repository as its
first step, integrating and loading data in its entirety from the data sources.
The bootstrapping of this process is not efficient for scientific research that
requires access to data from very large and typically numerous distributed data
sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy
ETL is faster in bootstrapping. However, queries on the integrated data
repository of eager ETL perform faster, due to the availability of the entire
data beforehand.
In this paper, we propose a novel ETL approach for scientific data
integration, as a hybrid of eager and lazy ETL approaches, and applied both to
data as well as metadata. This way, Hybrid ETL supports incremental integration
and loading of metadata and data from the data sources. We incorporate a
human-in-the-loop approach, to enhance the hybrid ETL, with selective data
integration driven by the user queries and sharing of integrated data between
users. We implement our hybrid ETL approach in a prototype platform, Obidos,
and evaluate it in the context of data sharing for medical research. Obidos
outperforms both the eager ETL and lazy ETL approaches, for scientific research
data integration and sharing, through its selective loading of data and
metadata, while storing the integrated data in a scalable integrated data
repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD
Journa
Efficiently identifying disguised nulls in heterogeneous text data
Informal publication onlyInternational audienceDigital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown or unapplicable information. While some data models allow representing nulls by special tokens, socalled disguised nulls are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability or unapplicability of the information. This paper describes our ongoing work toward detecting disguised nulls in textual data, encountered in ConnectionLens graphs. Driven by journalistic applications, we focus for now on large, semistructured datasets, where most or all data values are freeform text. We show that the state-of-the-art methods for detecting nulls in relational databases, mostly tailored towards numerical data, do not detect disguised nulls efficiently on such data. Then, we present two alternative methods: (i) leveraging Information Extraction, and (ii) text embeddings and classification. We detail their performance-precision trade-offs on real-world datasets
An Extensible Framework for Data Cleaning
Projet CARAVELData integration solutions dealing with large amounts of data have been strongly required in the last few years. Besides the traditional data integration problems (e.g. schema integration, local to global schema mappings), three additional data problems have to be dealt with: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyborad errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process. In this work, we propose a framework which offers the fundamental services required by this process: data transformation, duplicate elimination and multi-table matching. These services are implemented using a set of purposely designed macro-operators. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability of explicitly including the human interaction in the process. The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation. We measure the benefits of each
Declarative Data Cleaning : Language, Model, and Algorithms
Projet CARAVELThe problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning
Performance Analysis of One-to-Many Data Transformations
Relational Database Systems often support activities like data warehousing, cleaning and integration. All these activities require performing some sort of data transformations. Since data often resides on relational databases, data transformations are often specified using SQL, which is based of relational algebra. However, many useful data transformations cannot be expressed as SQL queries due to limited expressive power of relational algebra. In particular, an important class of data transformations that produces several output tuples for a single input tuple cannot be expressed in that way. In this report, we analyze alternatives to process one-to-many data transformations using Relational Database Systems, and compare them in terms of expressiveness, optimizability and performanc
On Multiple Semantics for Declarative Database Repairs
We study the problem of database repairs through a rule-based framework that
we refer to as Delta Rules. Delta Rules are highly expressive and allow
specifying complex, cross-relations repair logic associated with Denial
Constraints, Causal Rules, and allowing to capture Database Triggers of
interest. We show that there are no one-size-fits-all semantics for repairs in
this inclusive setting, and we consequently introduce multiple alternative
semantics, presenting the case for using each of them. We then study the
relationships between the semantics in terms of their output and the complexity
of computation. Our results formally establish the tradeoff between the
permissiveness of the semantics and its computational complexity. We
demonstrate the usefulness of the framework in capturing multiple data repair
scenarios for an Academic Search database and the TPC-H databases, showing how
using different semantics affects the repair in terms of size and runtime, and
examining the relationships between the repairs. We also compare our approach
with SQL triggers and a state-of-the-art data repair system