12,855 research outputs found
Information Integration - the process of integration, evolution and versioning
At present, many information sources are available wherever you are. Most of the time, the information needed is spread across several of those information sources. Gathering this information is a tedious and time consuming job. Automating this process would assist the user in its task. Integration of the information sources provides a global information source with all information needed present. All of these information sources also change over time. With each change of the information source, the schema of this source can be changed as well. The data contained in the information source, however, cannot be changed every time, due to the huge amount of data that would have to be converted in order to conform to the most recent schema.\ud
In this report we describe the current methods to information integration, evolution and versioning. We distinguish between integration of schemas and integration of the actual data. We also show some key issues when integrating XML data sources
Integration of Biological Sources: Exploring the Case of Protein Homology
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heteroge- neous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioin- formatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Un- certain databases are able to contain several possi- ble worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used
From Questions to Effective Answers: On the Utility of Knowledge-Driven Querying Systems for Life Sciences Data
We compare two distinct approaches for querying data in the context of the
life sciences. The first approach utilizes conventional databases to store the
data and intuitive form-based interfaces to facilitate easy querying of the
data. These interfaces could be seen as implementing a set of "pre-canned"
queries commonly used by the life science researchers that we study. The second
approach is based on semantic Web technologies and is knowledge (model) driven.
It utilizes a large OWL ontology and same datasets as before but associated as
RDF instances of the ontology concepts. An intuitive interface is provided that
allows the formulation of RDF triples-based queries. Both these approaches are
being used in parallel by a team of cell biologists in their daily research
activities, with the objective of gradually replacing the conventional approach
with the knowledge-driven one. This provides us with a valuable opportunity to
compare and qualitatively evaluate the two approaches. We describe several
benefits of the knowledge-driven approach in comparison to the traditional way
of accessing data, and highlight a few limitations as well. We believe that our
analysis not only explicitly highlights the specific benefits and limitations
of semantic Web technologies in our context but also contributes toward
effective ways of translating a question in a researcher's mind into precise
computational queries with the intent of obtaining effective answers from the
data. While researchers often assume the benefits of semantic Web technologies,
we explicitly illustrate these in practice
Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty
There is a growing need for methods which can capture uncertainties and
answer queries over graph-structured data. Two common types of uncertainty are
uncertainty over the attribute values of nodes and uncertainty over the
existence of edges. In this paper, we combine those with identity uncertainty.
Identity uncertainty represents uncertainty over the mapping from objects
mentioned in the data, or references, to the underlying real-world entities. We
propose the notion of a probabilistic entity graph (PEG), a probabilistic graph
model that defines a distribution over possible graphs at the entity level. The
model takes into account node attribute uncertainty, edge existence
uncertainty, and identity uncertainty, and thus enables us to systematically
reason about all three types of uncertainties in a uniform manner. We introduce
a general framework for constructing a PEG given uncertain data at the
reference level and develop highly efficient algorithms to answer subgraph
pattern matching queries in this setting. Our algorithms are based on two novel
ideas: context-aware path indexing and reduction by join-candidates, which
drastically reduce the query search space. A comprehensive experimental
evaluation shows that our approach outperforms baseline implementations by
orders of magnitude
Using Fuzzy Linguistic Representations to Provide Explanatory Semantics for Data Warehouses
A data warehouse integrates large amounts of extracted and summarized data from multiple sources for direct querying and analysis. While it provides decision makers with easy access to such historical and aggregate data, the real meaning of the data has been ignored. For example, "whether a total sales amount 1,000 items indicates a good or bad sales performance" is still unclear. From the decision makers' point of view, the semantics rather than raw numbers which convey the meaning of the data is very important. In this paper, we explore the use of fuzzy technology to provide this semantics for the summarizations and aggregates developed in data warehousing systems. A three layered data warehouse semantic model, consisting of quantitative (numerical) summarization, qualitative (categorical) summarization, and quantifier summarization, is proposed for capturing and explicating the semantics of warehoused data. Based on the model, several algebraic operators are defined. We also extend the SQL language to allow for flexible queries against such enhanced data warehouses
Dura
The reactive event processing language, that is developed in the context of this project, has been called DEAL in previous documents. When we chose this name for our language it has not been used by other authors working in the same research area (complex event processing). However, in the meantime it appears in publications of other authors and because we have not used the name in publications yet we cannot claim that we were the first to use it. In order to avoid ambiguities and name conflicts in future publications we decided to rename our language to Dura which stands for “Declarative uniform reactive event processing language”. Therefore the title of this deliverable has been updated to “Dura – Concepts and Examples”
- …