8 research outputs found

    A Research on Data Lakes and their Integration Challenges

    No full text
    With the advent of IoT and big data, we observed a huge variety of types of data (e.g. semi-structured data, conversational data, sensor data, photos, and videos) and sources (e.g. social networks, open data, webpages, and sensors). Data integration addresses the problem of reconciling data from different sources, with inconsistent schemata and formats, and possibly conflicting values. In this paper, I describe my PhD research topic: the enhancement of data integration, discovering new techniques capable of handling the peculiar characteristics of big data, and the study of novel frameworks and logical architectures to support the integration process

    Data fusion with source authority and multiple truth

    No full text
    The abundance of data available on the Web makes more and more probable the case of finding that different sources contain (partially or completely) different values for the same item. Data Fusion is the relevant problem of discovering the true values of a data item when two entities representing it have been found and their values are different. Recent studies have shown that when, for finding the true value of an object, we rely only on majority voting, results may be wrong for up to 30% of the data items, since false values are spread very easily because data sources frequently copy from one another. Therefore, the problem must be solved by assessing the quality of the sources and giving more importance to the values coming from trusted sources. State-of-the-art Data Fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources. In this paper we propose an improved algorithm for Data Fusion, that extends existing methods based on accuracy and correlation between sources by taking into account also source authority, defined on the basis of the knowledge of which sources copy from which ones. Our method has been designed to work well also in the multi-truth case, that is, when a data item can also have multiple true values. Preliminary experimental results on a multi-truth real-world dataset show that our algorithm outperforms previous state-of-the-art approaches

    Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarity

    No full text
    Data fusion, within the data integration pipeline, addresses the problem of discovering the true values of a data item when multiple sources provide different values for it. An important contribution to the solution of the problem can be given by assessing the quality of the involved sources and relying more on the values coming from trusted sources. State-of-the-art data fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources, and recently it has been also recognized that the trustworthiness of the same source may vary with the domain of interest. In this paper we propose STORM, a novel domain-aware algorithm for data fusion designed for the multi-truth case, that is, when a data item can also have multiple true values. Like many other data-fusion techniques, STORM relies on Bayesian inference. However, differently from the other Bayesian approaches to the problem, it determines the trustworthiness of sources by taking into account their authority: Here, we define authoritative sources as those that have been copied by many other ones, assuming that, when source administrators decide to copy data from other sources, they choose the ones they perceive as the most reliable. To group together the values that have been recognized as variants representing the same real-world entity, STORM provides also a value-reconciliation step, thus reducing the possibility of making mistakes in the remaining part of the algorithm. The experimental results on multi-truth synthetic and real-world datasets show that STORM represents a solid step forward in data-fusion research

    Workflow Characterization of a Big Data System Model for Healthcare Through Multiformalism

    No full text
    The development of technologies such as cloud computing, IoT, and social networks caused the amount of data generated daily to grow at an incredible rate, giving birth to the trend of Big Data. Big data has emerged in the healthcare field, thanks to the introduction of new tools producing massive amounts of structured and unstructured data. For this reason, medical institutions are moving towards a data-based healthcare, with the goal of leveraging this data to support clinical decision-making through suitable information systems. This comes with the need to evaluate their performance. One of the techniques commonly used is modeling, which consists in performing an evaluation of a model of the system under analysis, without actually implementing it. However, to make an adequate performance assessment of Big Data systems, we need a diversity of volumes and speeds that, due to the sensitivity of data concerning healthcare, is not available. While in other fields this problem is usually solved through the use of synthetic data generators, in healthcare these are few and not specialized in performance evaluation. Therefore, this work focuses on the creation of a synthetic data generator for evaluating the performance of a Big Data system model for healthcare. The dataset used as a reference for creating the generator is MIMIC-III, which contains the digital health records of thousands of patients collected over a time span of multiple years. First, we perform an analysis of the dataset, adopting multiple distribution fitting techniques (e.g., phase-type fitting) to model the temporal distribution of the data. Then, we develop a generator structured as a multi-module library to allow the customization of each component, specifically we propose a multiformalism model to reproduce the patient behavior inside the hospital. Finally, we test the generator by evaluating the performance in different scenarios. Through these experiments, we show the granular control that the generator offers over the synthetic data produced, and the simplicity with which it can be adapted to different uses

    Extraction of medical concepts from Italian natural language descriptions

    No full text
    In this paper we present a Natural Language Processing (NLP) pipeline to automatically extract medical concepts from a free text written in a language other than English. To do so, we use common NLP techniques and the metathesaurus of Unified Medical Language System (UMLS). Specifically, our goal is to automatically extract ontological concepts representing which part of the human body is injured and what is the nature of the injury, given an Italian textual description of a work accident. We start by partitioning the text into tokens and assigning to each token its part-of-speech, and then use an appropriate tool to extract relevant concepts to be searched within UMLS. We tested our system on a public large repository containing textual descriptions of work accidents produced by INAIL. Experimental results confirm that our system is able to correctly extract relevant medical concepts from texts written in Italian

    RECKOn: A real-world, context-aware knowledge-based lab

    No full text
    The RECKON project focuses on interconnection technologies and context-aware data-analytics techniques to improve safety in workplaces, with the ultimate objective of identifying and preventing dangerous situations before accidents occur. In RECKON, prevention is interpreted through the latest monitoring, diagnostics and prognostics techniques from a safety perspective, allowing to detect and use, even in real time, a large amount of data about the entire operational context. Using sensor networks, we are able to collect information that is used in two ways: (i) when a potentially dangerous situation is detected, the system raises an alarm to prevent an accident, and (ii) whenever an accident or a near-miss (i.e., a potential accident that was narrowly averted) occurs, the related useful information is stored in a case report automatically generated and later used to update the accident-prevention politics. This work briefly describes the operational framework of RECKON, along with its modules and their interaction
    corecore