567 research outputs found
A semantic and agent-based approach to support information retrieval, interoperability and multi-lateral viewpoints for heterogeneous environmental databases
PhDData stored in individual autonomous databases often needs to be combined and
interrelated. For example, in the Inland Water (IW) environment monitoring domain,
the spatial and temporal variation of measurements of different water quality indicators
stored in different databases are of interest. Data from multiple data sources is more
complex to combine when there is a lack of metadata in a computation forin and when
the syntax and semantics of the stored data models are heterogeneous. The main types
of information retrieval (IR) requirements are query transparency and data
harmonisation for data interoperability and support for multiple user views. A
combined Semantic Web based and Agent based distributed system framework has
been developed to support the above IR requirements. It has been implemented using
the Jena ontology and JADE agent toolkits. The semantic part supports the
interoperability of autonomous data sources by merging their intensional data, using a
Global-As-View or GAV approach, into a global semantic model, represented in
DAML+OIL and in OWL. This is used to mediate between different local database
views. The agent part provides the semantic services to import, align and parse
semantic metadata instances, to support data mediation and to reason about data
mappings during alignment. The framework has applied to support information
retrieval, interoperability and multi-lateral viewpoints for four European environmental
agency databases.
An extended GAV approach has been developed and applied to handle queries that can
be reformulated over multiple user views of the stored data. This allows users to
retrieve data in a conceptualisation that is better suited to them rather than to have to
understand the entire detailed global view conceptualisation. User viewpoints are
derived from the global ontology or existing viewpoints of it. This has the advantage
that it reduces the number of potential conceptualisations and their associated
mappings to be more computationally manageable. Whereas an ad hoc framework
based upon conventional distributed programming language and a rule framework
could be used to support user views and adaptation to user views, a more formal
framework has the benefit in that it can support reasoning about the consistency,
equivalence, containment and conflict resolution when traversing data models. A
preliminary formulation of the formal model has been undertaken and is based upon
extending a Datalog type algebra with hierarchical, attribute and instance value
operators. These operators can be applied to support compositional mapping and
consistency checking of data views. The multiple viewpoint system was implemented
as a Java-based application consisting of two sub-systems, one for viewpoint
adaptation and management, the other for query processing and query result
adjustment
Efficient similarity-based operations for data integration
Similarity-based operations, similarity join, similarity grouping, data integrationMagdeburg, Univ., Fak. fĂĽr Informatik, Diss., 2004von Eike Schalleh
BioWarehouse: a bioinformatics database warehouse toolkit
BACKGROUND: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. RESULTS: We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. CONCLUSION: BioWarehouse embodies significant progress on the database integration problem for bioinformatics
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
A abordagem POESIA para a integração de dados e serviços na Web semantica
Orientador: Claudia Bauzer MedeirosTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: POESIA (Processes for Open-Ended Systems for lnformation Analysis), a abordagem proposta neste trabalho, visa a construção de processos complexos envolvendo integração e análise de dados de diversas fontes, particularmente em aplicações cientĂficas. A abordagem Ă© centrada em
dois tipos de mecanismos da Web semântica: workflows cientĂficos, para especificar e compor serviços Web; e ontologias de domĂnio, para viabilizar a interoperabilidade e o gerenciamento semânticos dos dados e processos. As principais contribuições desta tese sĂŁo: (i) um arcabouço teĂłrico para a descrição, localização e composição de dados e serviços na Web, com regras para verificar a consistĂŞncia semântica de composições desses recursos; (ii) mĂ©todos baseados em ontologias de domĂnio para auxiliar a integração de dados e estimar a proveniĂŞncia de dados em processos cooperativos na Web; (iii) implementação e validação parcial das propostas, em urna aplicação real no domĂnio de planejamento agrĂcola, analisando os benefĂcios e as limitações de eficiĂŞncia e escalabilidade da tecnologia atual da Web semântica, face a grandes volumes de dadosAbstract: POESIA (Processes for Open-Ended Systems for Information Analysis), the approach proposed in this work, supports the construction of complex processes that involve the integration and analysis of data from several sources, particularly in scientific applications. This approach is centered in two types of semantic Web mechanisms: scientific workflows, to specify and compose Web services; and domain ontologies, to enable semantic interoperability and management of data and processes. The main contributions of this thesis are: (i) a theoretical framework to describe, discover and compose data and services on the Web, inc1uding mIes to check the semantic consistency of resource compositions; (ii) ontology-based methods to help data integration and estimate data provenance in cooperative processes on the Web; (iii) partial implementation and validation of the proposal, in a real application for the domain of agricultural planning, analyzing the benefits and scalability problems of the current semantic Web technology, when faced with large volumes of dataDoutoradoCiĂŞncia da ComputaçãoDoutor em CiĂŞncia da Computaçã
A conceptual framework and a risk management approach for interoperability between geospatial datacubes
De nos jours, nous observons un intérêt grandissant pour les bases de données géospatiales multidimensionnelles. Ces bases de données sont développées pour faciliter la prise de décisions stratégiques des organisations, et plus spécifiquement lorsqu’il s’agit de données de différentes époques et de différents niveaux de granularité. Cependant, les utilisateurs peuvent avoir besoin d’utiliser plusieurs bases de données géospatiales multidimensionnelles. Ces bases de données peuvent être sémantiquement hétérogènes et caractérisées par différent degrés de pertinence par rapport au contexte d’utilisation. Résoudre les problèmes sémantiques liés à l’hétérogénéité et à la différence de pertinence d’une manière transparente aux utilisateurs a été l’objectif principal de l’interopérabilité au cours des quinze dernières années. Dans ce contexte, différentes solutions ont été proposées pour traiter l’interopérabilité. Cependant, ces solutions ont adopté une approche non systématique. De plus, aucune solution pour résoudre des problèmes sémantiques spécifiques liés à l’interopérabilité entre les bases de données géospatiales multidimensionnelles n’a été trouvée. Dans cette thèse, nous supposons qu’il est possible de définir une approche qui traite ces problèmes sémantiques pour assurer l’interopérabilité entre les bases de données géospatiales multidimensionnelles. Ainsi, nous définissons tout d’abord l’interopérabilité entre ces bases de données. Ensuite, nous définissons et classifions les problèmes d’hétérogénéité sémantique qui peuvent se produire au cours d’une telle interopérabilité de différentes bases de données géospatiales multidimensionnelles. Afin de résoudre ces problèmes d’hétérogénéité sémantique, nous proposons un cadre conceptuel qui se base sur la communication humaine. Dans ce cadre, une communication s’établit entre deux agents système représentant les bases de données géospatiales multidimensionnelles impliquées dans un processus d’interopérabilité. Cette communication vise à échanger de l’information sur le contenu de ces bases. Ensuite, dans l’intention d’aider les agents à prendre des décisions appropriées au cours du processus d’interopérabilité, nous évaluons un ensemble d’indicateurs de la qualité externe (fitness-for-use) des schémas et du contexte de production (ex., les métadonnées). Finalement, nous mettons en œuvre l’approche afin de montrer sa faisabilité.Today, we observe wide use of geospatial databases that are implemented in many forms (e.g., transactional centralized systems, distributed databases, multidimensional datacubes). Among those possibilities, the multidimensional datacube is more appropriate to support interactive analysis and to guide the organization’s strategic decisions, especially when different epochs and levels of information granularity are involved. However, one may need to use several geospatial multidimensional datacubes which may be semantically heterogeneous and having different degrees of appropriateness to the context of use. Overcoming the semantic problems related to the semantic heterogeneity and to the difference in the appropriateness to the context of use in a manner that is transparent to users has been the principal aim of interoperability for the last fifteen years. However, in spite of successful initiatives, today's solutions have evolved in a non systematic way. Moreover, no solution has been found to address specific semantic problems related to interoperability between geospatial datacubes. In this thesis, we suppose that it is possible to define an approach that addresses these semantic problems to support interoperability between geospatial datacubes. For that, we first describe interoperability between geospatial datacubes. Then, we define and categorize the semantic heterogeneity problems that may occur during the interoperability process of different geospatial datacubes. In order to resolve semantic heterogeneity between geospatial datacubes, we propose a conceptual framework that is essentially based on human communication. In this framework, software agents representing geospatial datacubes involved in the interoperability process communicate together. Such communication aims at exchanging information about the content of geospatial datacubes. Then, in order to help agents to make appropriate decisions during the interoperability process, we evaluate a set of indicators of the external quality (fitness-for-use) of geospatial datacube schemas and of production context (e.g., metadata). Finally, we implement the proposed approach to show its feasibility
- …