7 research outputs found

    An automated ETL for online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    An automated ETL for online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    An automated ETL for online datasets

    Get PDF
    While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

    Centric Model Assessment for Collaborative Data Mining

    Get PDF
    Data mining is the task of discovering interesting patterns from large amounts of data. There are many data mining tasks, such as classification, clustering, association rule mining and sequential pattern mining. Sequential pattern mining finds sets of data items that occur together frequently in some sequences. Collaborative data mining refers to a data mining setting where different groups are geographically dispersed but work together on the same problem in a collaborative way. Such a setting requires adequate software support. Group work is widespread in education. The goal is to enable the groups and their facilitators to see relevant as pects of the groups operation and provide feedbacks if these are more likely to be associated with positive or negative outcomes and where the problems are. We explore how useful mirror information can be extracted via a theory-driven approach and a range of clustering and sequential pattern mining. In this paper we describe an experiment with a simple implementation of such a collaborative data mining environment. The experiment brings to light several problems, one of which is related to model assessment. We discuss several possible solutions. This discussion can contribute to a better understanding of how collaborative data mining is best organized

    An Integrative and Uniform Model for Metadata Management in Data Warehousing Environment

    Get PDF
    Due to the increasing complexity of data warehouses, a centralized and declarative management of metadata is essential for data warehouse administration, maintenance and usage. Metadata are usually divided into technical and semantic metadata. Typically, current approaches only support subsets of these metadata types, such as data movement metadata or multidimensional metadata for OLAP. In particular, the interdependencies between technical and semantic metadata have not yet been investigated sufficiently. The representation of these interdependencies form an important prerequisite for the translation of queries formulated at the business concept level to executable queries on physical data. Therefore, we suggest a uniform and integrative model for data warehouse metadata. This model uses a uniform representation approach based on the Uniform Modeling Language (UML) to integrate technical and semantic metadata and their interdependencies

    Plataforma de Integração na Área da Saúde

    Get PDF
    Esta dissertação foi elaborada em contexto empresarial, a empresa em questão é a Deloitte Portugal, que se foca em prestar serviços de consultadoria a terceiros e não só. Foi então aí que surgiu a oportunidade de desenvolver este projeto. A globalização e o desenvolvimento tecnológico alteraram a realidade de diversas organizações. Atualmente há um aumento da necessidade de ter informação disponível online e atualizada, pois, esta garante que as decisões sejam tomadas com mais assertividade. Com o surgimento de novas soluções de Information Technology (TI) para atender as demandas do mercado em geral, ter uma integração de sistemas tornou-se essencial para otimizar processos, centralizar dados e melhorar a experiência dos utilizadores. O problema descrito nesta dissertação passa pela a ausência de uma camada intermédia capaz de efetuar comunicações entre sistemas heterogéneos e fazer gestão dessas mesmas comunicações. O objetivo desta dissertação é o estudo e implementação de uma framework de integração capaz de responder aos diversos pedidos que o software de gestão hospitalar, já desenvolvido, ePatient consiga comunicar com os diversos serviços que um hospital tenha implementado. Parte do desenvolvimento desta framework passa por tratar erros, redirecionar pedidos e monitorizar Application Programming Interfaces (APIs), para isso é usado um software de integração.This dissertation was developed in a business context, the company in question is Deloitte Portugal, which focuses on providing consultancy services to third parties and beyond. It was then that the opportunity to develop this project arose. Globalization and technological development have changed the reality of several organizations. Currently, there is an increasing need to have information available online and updated, as this ensures that decisions are made with more assertiveness. With the emergence of new Information Technology (IT) solutions to meet the demands of the market in general, having a system integration has become essential to optimize processes, centralize data and improve the user experience. The problem described in this dissertation is the absence of an intermediate layer capable of making communications between heterogeneous systems and managing those communications. The objective of this dissertation is the study and implementation of an integration framework capable of responding to the different requests that the hospital management software, already developed, ePatient is able to communicate with the various services that a hospital has implemented. Part of the development of this framework involves handling errors, redirecting requests and monitoring Application Programming Interfaces (APIs), for which an integration software is used

    Reusing dynamic data marts for query management in an on-demand ETL architecture

    Get PDF
    Data analysts working often have a requirement to integrate an in-house data warehouse with external datasets, especially web-based datasets. Doing so can give them important insights into their performance when compared with competitors, their industry in general on a global scale, and make predictions as to sales, providing important decision support services. The quality of these insights depends on the quality of the data imported into the analysis dataset. There is a wealth of data freely available from government sources online but little unity between data sources, leading to a requirement for a data processing layer wherein various types of quality issues and heterogeneities can be resolved. Traditionally, this is achieved with an Extract-Transform-Load (ETL) series of processes which are performed on all of the available data, in advance, in a batch process typically run outside of business hours. While this is recognized as a powerful knowledge-based support, it is very expensive to build and maintain, and is very costly to update, in the event that new data sources become available. On-demand ETL offers a solution in that data is only acquired when needed and new sources can be added as they come online. However, this form of dynamic ETL is very difficult to deliver. In this research dissertation, we explore the possibilities of creating dynamic data marts which can be created using non-warehouse data to support the inclusion of new sources. We then examine how these dynamic structures can be used for query fulfillment andhow they can support an overall on-demand query mechanism. At each step of the research and development, we employ a robust validation using a real-world data warehouse from the agricultural domain with selected Agri web sources to test the dynamic elements of the proposed architecture
    corecore