56,295 research outputs found

    Data mining and fusion

    No full text

    On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

    Full text link
    Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Obidos, and evaluate it in the context of data sharing for medical research. Obidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD Journa

    A space-time multivariate Bayesian model to analyse road traffic accidents by severity

    Get PDF
    The paper investigates the dependences between levels of severity of road traffic accidents, accounting at the same time for spatial and temporal correlations. The study analyses road traffic accidents data at ward level in England over the period 2005–2013. We include in our model multivariate spatially structured and unstructured effects to capture the dependences between severities, within a Bayesian hierarchical formulation. We also include a temporal component to capture the time effects and we carry out an extensive model comparison. The results show important associations in both spatially structured and unstructured effects between severities, and a downward temporal trend is observed for low and high levels of severity. Maps of posterior accident rates indicate elevated risk within big cities for accidents of low severity and in suburban areas in the north and on the southern coast of England for accidents of high severity. The posterior probability of extreme rates is used to suggest the presence of hot spots in a public health perspective.Areti Boulieri acknowledges support from the National Institute for Health Research and the Medical Research Council Doctoral Training Partnership. Marta Blangiardo acknowledges support from the National Institute for Health Research and the Medical Research Council–Public Health England Centre for Environment and Health. Silvia Liverani acknowledges support from the Leverhulme Trust (grant ECF-2011-576)

    Building Data-Driven Pathways From Routinely Collected Hospital Data:A Case Study on Prostate Cancer

    Get PDF
    Background: Routinely collected data in hospitals is complex, typically heterogeneous, and scattered across multiple Hospital Information Systems (HIS). This big data, created as a byproduct of health care activities, has the potential to provide a better understanding of diseases, unearth hidden patterns, and improve services and cost. The extent and uses of such data rely on its quality, which is not consistently checked, nor fully understood. Nevertheless, using routine data for the construction of data-driven clinical pathways, describing processes and trends, is a key topic receiving increasing attention in the literature. Traditional algorithms do not cope well with unstructured processes or data, and do not produce clinically meaningful visualizations. Supporting systems that provide additional information, context, and quality assurance inspection are needed. Objective: The objective of the study is to explore how routine hospital data can be used to develop data-driven pathways that describe the journeys that patients take through care, and their potential uses in biomedical research; it proposes a framework for the construction, quality assessment, and visualization of patient pathways for clinical studies and decision support using a case study on prostate cancer. Methods: Data pertaining to prostate cancer patients were extracted from a large UK hospital from eight different HIS, validated, and complemented with information from the local cancer registry. Data-driven pathways were built for each of the 1904 patients and an expert knowledge base, containing rules on the prostate cancer biomarker, was used to assess the completeness and utility of the pathways for a specific clinical study. Software components were built to provide meaningful visualizations for the constructed pathways. Results: The proposed framework and pathway formalism enable the summarization, visualization, and querying of complex patient-centric clinical information, as well as the computation of quality indicators and dimensions. A novel graphical representation of the pathways allows the synthesis of such information. Conclusions: Clinical pathways built from routinely collected hospital data can unearth information about patients and diseases that may otherwise be unavailable or overlooked in hospitals. Data-driven clinical pathways allow for heterogeneous data (ie, semistructured and unstructured data) to be collated over a unified data model and for data quality dimensions to be assessed. This work has enabled further research on prostate cancer and its biomarkers, and on the development and application of methods to mine, compare, analyze, and visualize pathways constructed from routine data. This is an important development for the reuse of big data in hospitals
    • …
    corecore