7 research outputs found

    Graph-Based ETL Processes For Warehousing Statistical Open Data

    Get PDF
    ICEIS 2015 will be held in conjunction with ENASE 2015 and GISTAM 2015International audienceWarehousing is a promising mean to cross and analyse Statistical Open Data (SOD). But extracting structures, integrating and defining multidimensional schema from several scattered and heterogeneous tables in the SOD are major problems challenging the traditional ETL (Extract-Transform-Load) processes. In this paper, we present a three step ETL processes which rely on RDF graphs to meet all these problems. In the first step, we automatically extract tables structures and values using a table anatomy ontology. This phase converts structurally heterogeneous tables into a unified RDF graph representation. The second step performs a holistic integration of several semantically heterogeneous RDF graphs. The optimal integration is performed through an Integer Linear Program (ILP). In the third step, system interacts with users to incrementally transform the integrated RDF graph into a multidimensional schema

    A Linear Program For Holistic Matching : Assessment on Schema Matching Benchmark

    Get PDF
    International audienceSchema matching is a key task in several applications such as data integration and ontology engineering. All application fields require the matching of several schemes also known as "holistic matching", but the difficulty of the problem spawned much more attention to pairwise schema matching rather than the latter. In this paper, we propose a new approach for holistic matching. We suggest modelling the problem with some techniques borrowed from the combinatorial optimization field. We propose a linear program, named LP4HM, which extends the maximum-weighted graph matching problem with different linear constraints. The latter encompass matching setup constraints, especially cardinality and threshold constraints; and schema structural constraints, especially superclass/subclass and coherence constraints. The matching quality of LP4HM is evaluated on a recent benchmark dedicated to assessing schema matching tools. Experimentations show competitive results compared to other tools, in particular for recall and HSR quality measure

    Intégration Holistique des Graphes basée sur la Programmation Linéaire pour l'Entreposage des Open Data

    Get PDF
    National audienceDans cet article, nous proposons une approche holistique pour l'intégration des graphes d'Open Data. Ces graphes représentent une classification hiérarchique des concepts extraits des Open Data. Nous nous focalisons sur la conservation de hiérarchies strictes lors de l'intégration afin de pouvoir définir un schéma multidimensionnel à partir de ces hiérarchies et entreposer par la suite ces sources de données. Notre approche est basée sur un programme linéaire qui résout automatiquement la tâche de matching des graphes tout en maximisant globalement la somme des similarités entre les concepts. Ce programme est composé de contraintes sur la cardinalité du matching et de contraintes sur la structure des graphes. A notre connaissance, notre approche est la première à fournir une solution optimale globale pour le matching holistique des graphes avec un temps de résolution raisonnable. Nous comparons également la qualité des résultats de notre approche par rapport à d'autres approches de la littérature

    A Machine Learning Classification Framework for Early Prediction of Alzheimer’s Disease

    Get PDF
    People today, in addition to their concerns about getting old and having to go through watching themselves grow weak and wrinkly, are facing an increasing fear of dementia. There are around 47 million people affected by dementia worldwide and the cost associated with providing them health and social care support is estimated to reach 2 trillion by 2030 which is almost equivalent to the 18th largest economy in the world. The most common form of dementia with the highest costs in health and social care is Alzheimer’s disease, which gradually kills neurons and causes patients to lose loving memories, the ability to recognise family members, childhood memories, and even the ability to follow simple instructions. Alzheimer’s disease is irreversible, unstoppable and has no known cure. Besides being a calamity to affected patients, it is a great financial burden on health providers. Health care providers also face a challenge in diagnosing the disease as current methods used to diagnose Alzheimer’s disease rely on manual evaluations of a patient’s medical history and mental examinations such as the Mini-Mental State Examination. These diagnostic methods often give a false diagnosis and were designed to identify Alzheimer’s after stage two when the part of all symptoms are evident. The problem is that clinicians are unable to stop or control the progress of Alzheimer’s disease, because of a lack of knowledge on the patterns that triggered the development of the disease. In this thesis, we explored and investigated Alzheimer’s disease from a computational perspective to uncover different risk factors and present a strategic framework called Early Prediction of Alzheimer’s Disease Framework (EPADf) that would give a future prediction of early-onset Alzheimer’s disease. Following extensive background research that resulted in the formalisation of the framework concept, prediction approaches, and the concept of ranking the risk factors based on clinical instinct, knowledge and experience using mathematical reasoning, we carried out experiments to get further insight and investigate the disease further using machine learning models. In this study, we used machine learning models and conducted two classification experiments for early prediction of Alzheimer’s disease, and one ranking experiment to rank its risk factors by importance. Besides these experiments, we also presented two logical approaches to search for patterns in an Alzheimer’s dataset, and a ranking algorithm to rank Alzheimer’s disease risk factors based on clinical evaluation. For the classification experiments we used five different Machine Learning models; Random Forest (RF), Random Oracle Model (ROM), a hybrid model combined of Levenberg-Marquardt neural network and Random Forest, combined using Fischer discriminate analysis (H2), Linear Neural Networks (LNN), and Multi-layer Perceptron Model (MLP). These models were deployed on a de-identified multivariable patient’s data, provided by the ADNI (Alzheimer’s disease Neuroimaging Initiative), to illustrate the effective use of data analysis to investigate Alzheimer’s disease biological and behavioural risk factors. We found that the continues enhancement of patient’s data and the use of combined machine learning models can provide an early cost-effective prediction of Alzheimer’s disease, and help in extracting insightful information on the risk factors of the disease. Based on this work and findings we have developed the strategic framework (EPADf) which is discussed in more depth in this thesis

    A utilização de dados públicos abertos na construção de um Data Warehouse : a construção de um repositório estatísticas educacionais públicas brasileiras

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceNa última década, diferentes países têm desenvolvido iniciativas relacionadas à divulgação de dados governamentais de forma aberta. Apesar da existência e disponibilização das bases de dados, a tarefa de utilização e extração de conhecimento dessas bases ainda apresenta alguns desafios, relacionados a à integração e à compatibilização das informações. Isso ocorre devido à baixa estruturação e a grande heterogeneidade das fontes, que faz com que as abordagens tradicionais de extração transformação e carga (ETL) tornem-se menos eficientes. Esse trabalho busca analisar uma abordagem de construção de um repositório de dados abertos baseada na estrutura dos arquivos unidimensionais (flat files), que possibilite a construção dos modelos dimensionais de forma mais eficiente.In the last decade, different countries have developed initiatives related to the dissemination of open data. Despite the existence and availability of databases, the task of using this data and knowledge extraction still presents some challenges related to the integration and compatibility of information. This occurs due to both poor-structure and a great heterogeneity of sources, which make traditional extraction, transformation, and loading (ETL) approach less efficient. This manuscript analyzes an approach for the construction of open data repository based on a flat files structure that enables a more efficient dimensional model building

    European Distance and E-Learning Network (EDEN). Conference Proceedings

    Get PDF
    Erasmus+ Programme of the European UnionThe powerful combination of the information age and the consequent disruption caused by these unstable environments provides the impetus to look afresh and identify new models and approaches for education (e.g. OERs, MOOCs, PLEs, Learning Analytics etc.). For learners this has taken a fantastic leap into aggregating, curating and co-curating and co-producing outside the boundaries of formal learning environments – the networked learner is sharing voluntarily and for free, spontaneously with billions of people.Supported by Erasmus+ Programme of the European Unioninfo:eu-repo/semantics/publishedVersio

    A Content-Driven ETL Processes for Open Data (ADBIS 2014)

    No full text
    International audienceThe emergent statistical Open Data (OD) seems very promising to generate various analysis scenarios for decision-making systems. Nevertheless, OD has problematic characteristics such as semantic and structural heterogeneousness, lack of schemas, autonomy and dispersion. These characteristics shakes the traditional Extract-Transform-Load (ETL) processes since these latter generally deal with well structured schemas. We propose in this paper a content-driven ETL processes which automates ''as far as possible'' the extraction phase based only on the content of flat Open Data sources. Our processes rely on data annotations and data mining techniques to discover hierarchical relationships. Processed data are then transformed into instance-schema graphs to facilitate the structural data integration and the definition of the multidimensional schemas of the data warehouse
    corecore