15 research outputs found
SemLinker: automating big data integration for casual users
A data integration approach combines data from different sources and builds a unified view for the users. Big data integration inherently is a complex task, and the existing approaches are either potentially limited or invariably rely on manual inputs and interposition from experts or skilled users. SemLinker, an ontology-based data integration system, is part of a metadata management framework for personal data lake (PDL), a personal store-everything architecture. PDL is for casual and unskilled users, therefore SemLinker adopts an automated data integration workflow to minimize manual input requirements. To support the flat architecture of a lake, SemLinker builds and maintains a schema metadata level without involving any physical transformation of data during integration, preserving the data in their native formats while, at the same time, allowing them to be queried and analyzed. Scalability, heterogeneity, and schema evolution are big data integration challenges that are addressed by SemLinker. Large and real-world datasets of substantial heterogeneities are used in evaluating SemLinker. The results demonstrate and confirm the integration efficiency and robustness of SemLinker, especially regarding its capability in the automatic handling of data heterogeneities and schema evolutions
Query Translation in a Database Sharing Peer to Peer Network
In a peer to peer database sharing network users query data from all peers using one query as if they are querying data from one database. Implementing such a facility requires solutions to the problems of schema conflicts and query translation. Query translation is the problem of rewriting a query posed in terms of one schema to the query in terms of the other schema. Schema conflicts refer to the problems which come as the results of integrating data from databases which were designed independently. This paper proposes the architecture for integrating and querying databases in the peer to peer (P2P)network
Enabling Cross Constraint Satisfaction in RDF-Based Heterogeneous Database Integration
Abstract The problem of database integration has been widely tackled through different approaches. While data transformation based systems, such as Data Warehouses, reached the acceptation of the industry during the 80's, in the last decade query translation based approaches have gained popularity given their adequacy to dynamic domains
Automatic Table Extension with Open Data
With thousands of data sources available on the web as well as within organisations, data scientists increasingly spend more time searching for data than analysing it. To ease the task of find and integrating relevant data for data mining projects, this dissertation presents two new methods for automatic table extension. Automatic table extension systems take over the task of tata discovery and data integration by adding new columns with new information (new attributes) to any table. The data values in the new columns are extracted from a given corpus of tables
Modeling tools for the integration of structured data sources
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2011."December 2010." Cataloged from PDF version of thesis.Includes bibliographical references (p. 61-64).Disparity in representations within structured documents such as XML or SQL makes interoperability challenging, error-prone and expensive. A model is developed to process disparate representations to an encompassing generic knowledge representation. Data sources were characterized according to a number of smaller models: their case; the underlying data storage structures; a content model based on the ontological structure defined by the documents schema; and the data model or physical structure of the schema. In order to harmonize different representations and give them semantic meaning, from the above categories the representation is mapped to a common dictionary. The models were implemented as a structured data analysis tool and a basis was built to compare across schema and documents. Data exchange within modeling and simulation environments are increasingly in the form of XML using a variety of schema. Therefore, we demonstrate the use of this modeling tool to automatically harmonized multiple disparate XML data sources in a prototype simulated environment.by Jyotsna Venkataramanan.M.Eng
A data transformation model for relational and non-relational data
The information systems that support small, medium, and large organisations need data transformation solutions from multiple data sources to fulfill the requirements of new applications and decision-making to stay competitive. Relational data is the foundation for the majority of applications programme, whereas non-relational data is the foundation for the majority of newly produced applications. The relational model is the most elegant one; nonetheless, this kind of database has a drawback when it comes to managing very large volumes of data. Because they can handle massive volumes of data, non-relational databases have evolved into relational database substitutes. The key issue is that rules for data transformation processes across various data types are becoming less well-defined, leading to a steady decline in data quality. Therefore, to handle relational and non-relational data and satisfy the requirements for data quality, an empirical model in this domain knowledge is required. This study seeks to develop a data transformation model used for different data sources while satisfying data quality requirements, especially the transformation processes in relational and non-relational model, named Data Transformation with Two ETL Phases and Central-Library (DTTEPC). The different stages and methods in the developed model are used to transform the metadata information and stored data from relational to non-relational systems, and vice versa. The model is developed and validated through expert review, and the prototype based on the final version is employed in two case studies: education and healthcare. The results of the usability test demonstrate that the developed model is capable of transforming metadata data and stored data across systems. So enhancing the information systems in various organizations through data transformation solutions. The DTTEPC model improved the integrity and completeness of the data transformation processes. Moreover, supports decision-makers by utilizing information from various sources and systems in real-time demands
Data integration support for offshore decommissioning waste management
Offshore oil and gas platforms have a design life of about 25 years whereas the techniques and tools
used for managing their data are constantly evolving. Therefore, data captured about platforms during
their lifetimes will be in varying forms. Additionally, due to the many stakeholders involved with a facility
over its life cycle, information representation of its components varies. These challenges make data
integration difficult. Over the years, data integration technology application in the oil and gas industry
has focused on meeting the needs of asset life cycle stages other than decommissioning. This is the
case because most assets are just reaching the end of their design lives.
Currently, limited work has
been done on integrating life cycle data for offshore decommissioning purposes, and reports by industry
stakeholders underscore this need.
This thesis proposes a method for the integration of the common data types relevant in oil and gas
decommissioning. The key features of the method are that it (i) ensures semantic homogeneity using
knowledge representation languages (Semantic Web) and domain specific reference data (ISO 15926);
and (ii) allows stakeholders to continue to use their current applications. Prototypes of the framework
have been implemented using open source software applications and performance measures made.
The work of this thesis has been motivated by the business case of reusing offshore decommissioning
waste items. The framework developed is generic and can be applied whenever there is a need to
integrate and query disparate data involving oil and gas assets. The prototypes presented show how
the data management challenges associated with assessing the suitability of decommissioned offshore
facility items for reuse can be addressed. The performance of the prototypes show that significant time
and effort is saved compared to the state-of‐the‐art solution. The ability to do this effectively and
efficiently during decommissioning will advance the oil the oil and gas industry’s transition toward a
circular economy and help save on cost
A semantic and agent-based approach to support information retrieval, interoperability and multi-lateral viewpoints for heterogeneous environmental databases
PhDData stored in individual autonomous databases often needs to be combined and
interrelated. For example, in the Inland Water (IW) environment monitoring domain,
the spatial and temporal variation of measurements of different water quality indicators
stored in different databases are of interest. Data from multiple data sources is more
complex to combine when there is a lack of metadata in a computation forin and when
the syntax and semantics of the stored data models are heterogeneous. The main types
of information retrieval (IR) requirements are query transparency and data
harmonisation for data interoperability and support for multiple user views. A
combined Semantic Web based and Agent based distributed system framework has
been developed to support the above IR requirements. It has been implemented using
the Jena ontology and JADE agent toolkits. The semantic part supports the
interoperability of autonomous data sources by merging their intensional data, using a
Global-As-View or GAV approach, into a global semantic model, represented in
DAML+OIL and in OWL. This is used to mediate between different local database
views. The agent part provides the semantic services to import, align and parse
semantic metadata instances, to support data mediation and to reason about data
mappings during alignment. The framework has applied to support information
retrieval, interoperability and multi-lateral viewpoints for four European environmental
agency databases.
An extended GAV approach has been developed and applied to handle queries that can
be reformulated over multiple user views of the stored data. This allows users to
retrieve data in a conceptualisation that is better suited to them rather than to have to
understand the entire detailed global view conceptualisation. User viewpoints are
derived from the global ontology or existing viewpoints of it. This has the advantage
that it reduces the number of potential conceptualisations and their associated
mappings to be more computationally manageable. Whereas an ad hoc framework
based upon conventional distributed programming language and a rule framework
could be used to support user views and adaptation to user views, a more formal
framework has the benefit in that it can support reasoning about the consistency,
equivalence, containment and conflict resolution when traversing data models. A
preliminary formulation of the formal model has been undertaken and is based upon
extending a Datalog type algebra with hierarchical, attribute and instance value
operators. These operators can be applied to support compositional mapping and
consistency checking of data views. The multiple viewpoint system was implemented
as a Java-based application consisting of two sub-systems, one for viewpoint
adaptation and management, the other for query processing and query result
adjustment
Combining the Best of Global-as-View and Local-as-View for Data Integration
there are two main basic approach es to data integration: Global-as-View (GaV) and Local-as-View (LaV). However, both GaV and LaV ave th eir limitations. In a GaV approach , ch anges in information sources or adding a new information source requires revisions of a global schA7 and mappings between th global sch ema and source schIW s. In a LaV approach automating query reformulation has exponential time complexitywith respect to query and source sch ema definitions. To resolve these problems, we o#er BGLaV as an alternative point of view that is neither GaV nor LaV. The approach uses source-to-target mappings based on a predefined conceptual target schIz , whzh is specified ontologically and independently of any of th e sources. Th e proposed data integration system is easier to maintain th an both GaV and LaV, and query reformulation reduces to rule unfolding. Compared with othO data integration approach es, our approach combines th e advantages of GaV and LaV, mitigates th e disadvantages, and provides an alternative for flexible and scalable data integration.