71 research outputs found

    Enabling automatic provenance-based trust assessment of web content

    Get PDF

    DARIAH and the Benelux

    Get PDF

    Reconstructing Data Provenance from Log Files

    Get PDF
    Data provenance describes the derivation history of data, capturing details such as the entities involved and the relationships between entities. Knowledge of data provenance can be used to address issues, such as data quality assurance, data audit and system security. However, current computer systems are usually not equipped with means to acquire data provenance. Modifying underlying systems or introducing new monitoring software for provenance logging may be too invasive for production systems. As a result, data provenance may not always be available. This thesis investigates the completeness and correctness of data provenance reconstructed from log files with respect to the actual derivation history. To accomplish this, we designed and tested a solution that first extracts and models information from log files into provenance relations then reconstructs the data provenance from those relations. The reconstructed output is then evaluated against the ground truth provenance. The thesis also details the methodology used for constructing a dataset for provenance reconstruction research. Experimental results revealed data provenance that completely captures the ground truth can be reconstructed from system-layer log files. However, the outputs are susceptible to errors generated during event logging and errors induced by program dependencies. Results also show that usage of log files of different granularities collected from the system can help resolve logging errors described. Experiments with removing suspected program dependencies using approaches such as blacklisting and clustering have shown that the number of errors can be reduced by a factor of one hundred. Conclusions drawn from this research contribute towards the work on using reconstruction as an alternative approach for acquiring data provenance from computer systems

    Integrative bioinformatics and graph-based methods for predicting adverse effects of developmental drugs

    Get PDF
    Adverse drug effects are complex phenomena that involve the interplay between drug molecules and their protein targets at various levels of biological organisation, from molecular to organismal. Many factors are known to contribute toward the safety profile of a drug, including the chemical properties of the drug molecule itself, the biological properties of drug targets and other proteins that are involved in pharmacodynamics and pharmacokinetics aspects of drug action, and the characteristics of the intended patient population. A multitude of scattered publicly available resources exist that cover these important aspects of drug activity. These include manually curated biological databases, high-throughput experimental results from gene expression and human genetics resources as well as drug labels and registered clinical trial records. This thesis proposes an integrated analysis of these disparate sources of information to help bridge the gap between the molecular and the clinical aspects of drug action. For example, to address the commonly held assumption that narrowly expressed proteins make safer drug targets, an integrative data-driven analysis was conducted to systematically investigate the relationship between the tissue expression profile of drug targets and the organs affected by clinically observed adverse drug reactions. Similarly, human genetics data were used extensively throughout the thesis to compare adverse symptoms induced by drug molecules with the phenotypes associated with the genes encoding their target proteins. One of the main outcomes of this thesis was the generation of a large knowledge graph, which incorporates diverse molecular and phenotypic data in a structured network format. To leverage the integrated information, two graph-based machine learning methods were developed to predict a wide range of adverse drug effects caused by approved and developmental therapies

    Discovering lesser known molecular players and mechanistic patterns in Alzheimer's disease using an integrative disease modelling approach

    Get PDF
    Convergence of exponentially advancing technologies is driving medical research with life changing discoveries. On the contrary, repeated failures of high-profile drugs to battle Alzheimer's disease (AD) has made it one of the least successful therapeutic area. This failure pattern has provoked researchers to grapple with their beliefs about Alzheimer's aetiology. Thus, growing realisation that Amyloid-β and tau are not 'the' but rather 'one of the' factors necessitates the reassessment of pre-existing data to add new perspectives. To enable a holistic view of the disease, integrative modelling approaches are emerging as a powerful technique. Combining data at different scales and modes could considerably increase the predictive power of the integrative model by filling biological knowledge gaps. However, the reliability of the derived hypotheses largely depends on the completeness, quality, consistency, and context-specificity of the data. Thus, there is a need for agile methods and approaches that efficiently interrogate and utilise existing public data. This thesis presents the development of novel approaches and methods that address intrinsic issues of data integration and analysis in AD research. It aims to prioritise lesser-known AD candidates using highly curated and precise knowledge derived from integrated data. Here much of the emphasis is put on quality, reliability, and context-specificity. This thesis work showcases the benefit of integrating well-curated and disease-specific heterogeneous data in a semantic web-based framework for mining actionable knowledge. Furthermore, it introduces to the challenges encountered while harvesting information from literature and transcriptomic resources. State-of-the-art text-mining methodology is developed to extract miRNAs and its regulatory role in diseases and genes from the biomedical literature. To enable meta-analysis of biologically related transcriptomic data, a highly-curated metadata database has been developed, which explicates annotations specific to human and animal models. Finally, to corroborate common mechanistic patterns — embedded with novel candidates — across large-scale AD transcriptomic data, a new approach to generate gene regulatory networks has been developed. The work presented here has demonstrated its capability in identifying testable mechanistic hypotheses containing previously unknown or emerging knowledge from public data in two major publicly funded projects for Alzheimer's, Parkinson's and Epilepsy diseases

    Investigative Methods:An NCRM Innovation Collection

    Get PDF

    Computational Stylistics in Poetry, Prose, and Drama

    Get PDF
    The contributions in this edited volume approach poetry, narrative, and drama from the perspective of Computational Stylistics. They exemplify methods of computational textual analysis and explore the possibility of computational generation of literary texts. The volume presents a range of computational and Natural Language Processing applications to literary studies, such as motif detection, network analysis, machine learning, and deep learning

    Forgotten as data – remembered through information. Social memory institutions in the digital age: the case of the Europeana Initiative

    Get PDF
    The study of social memory has emerged as a rich field of research closely linked to cultural artefacts, communication media and institutions as carriers of a past that transcends the horizon of the individual’s lifetime. Within this domain of research, the dissertation focuses on memory institutions (libraries, archives, museums) and the shifts they are undergoing as the outcome of digitization and the diffusion of online media. Very little is currently known about the impact that digitality and computation may have on social memory institutions, specifically, and social memory, more generally – an area of study that would benefit from but, so far, has been mostly overlooked by information systems research. The dissertation finds its point of departure in the conceptualization of information as an event that occurs through the interaction between an observer and the observed – an event that cannot be stored as information but merely as data. In this context, memory is conceived as an operation that filters, thus forgets, the singular details of an information event by making it comparable to other events according to abstract classification criteria. Against this backdrop, memory institutions are institutions of forgetting as they select, order and preserve a canon of cultural heritage artefacts. Supported by evidence from a case study on the Europeana initiative (a digitization project of European libraries, archives and museums), the dissertation reveals a fundamental shift in the field of memory institutions. The case study demonstrates the disintegration of 1) the cultural heritage artefact, 2) its standard modes of description and 3) the catalogue as such into a steadily accruing assemblage of data and metadata. Dismembered into bits and bytes, cultural heritage needs to be re-membered through the emulation of recognizable cultural heritage artefacts and momentary renditions of order. In other words, memory institutions forget as binary-based data and remember through computational information

    Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications

    Get PDF
    Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios

    Computational Stylistics in Poetry, Prose, and Drama

    Get PDF
    The contributions in this edited volume approach poetry, narrative, and drama from the perspective of Computational Stylistics. They exemplify methods of computational textual analysis and explore the possibility of computational generation of literary texts. The volume presents a range of computational and Natural Language Processing applications to literary studies, such as motif detection, network analysis, machine learning, and deep learning
    corecore