11 research outputs found

    Enabling Interactive Analytics of Secure Data using Cloud Kotta

    Full text link
    Research, especially in the social sciences and humanities, is increasingly reliant on the application of data science methods to analyze large amounts of (often private) data. Secure data enclaves provide a solution for managing and analyzing private data. However, such enclaves do not readily support discovery science---a form of exploratory or interactive analysis by which researchers execute a range of (sometimes large) analyses in an iterative and collaborative manner. The batch computing model offered by many data enclaves is well suited to executing large compute tasks; however it is far from ideal for day-to-day discovery science. As researchers must submit jobs to queues and wait for results, the high latencies inherent in queue-based, batch computing systems hinder interactive analysis. In this paper we describe how we have augmented the Cloud Kotta secure data enclave to support collaborative and interactive analysis of sensitive data. Our model uses Jupyter notebooks as a flexible analysis environment and Python language constructs to support the execution of arbitrary functions on private data within this secure framework.Comment: To appear in Proceedings of Workshop on Scientific Cloud Computing, Washington, DC USA, June 2017 (ScienceCloud 2017), 7 page

    Can the NHS be a learning healthcare system in the age of digital technology?

    Get PDF
    ‘Big data’ is defined by ‘7 V’s’: volume (most frequently cited1), velocity, veracity, variety, volatility, validity and value. In healthcare, ‘big data’ is associated with a step-change in the way information is gathered, analysed and used to facilitate disease management and prevention. With greater electronic data capture, there is enthusiasm for increased safety, efficiency and effectiveness in health and social care through, for example, machine learning and other forms of artificial intelligence (AI). However, factors maintaining and widening the gap between the promise and the reality need to be addressed

    Reproducible big data science: A case study in continuous FAIRness.

    Get PDF
    Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods

    Precision health approaches: ethical considerations for health data processing

    Get PDF
    This thesis provides insights and recommendations on some of the most crucial elements necessary for an effective, legally and ethically sound implementation of precision health approaches in the Swiss context (and beyond), specifically for precision medicine and precision public health. In this regard, this thesis recognizes the centrality of data in these two abovementioned domains, and the ethical and scientific imperative of ensuring the widespread and responsible sharing of high quality health data between the numerous stakeholders involved in healthcare, public health and associated research domains. It also recognizes the need to protect not only the interests of data subjects but also those of data processors. Indeed, it is only through a comprehensive assessment of the needs and expectations of each and every one regarding data sharing activities that sustainable solutions to known ethical and scientific conundrums can be devised and implemented. In addition, the included chapters in this thesis emphasize recommending solutions that could be convincingly applied to real world problems, with the ultimate objective of having a concrete impact on clinical and public health practice and policies, including research activities. Indeed, the strengths of this thesis reside in a careful and in-depth interdisciplinary assessment of the different issues at stake in precision health approaches, with the elaboration of the least disruptive solutions (as far as possible) and recommendations for an easy evaluation and subsequent adoption by relevant stakeholders active in these two domains. This thesis has three main objectives, namely (i) to investigate and identify factors influencing the processing of health data in the Swiss context and suggest some potential solutions and recommendations. A better understanding of these factors is paramount for an effective implementation of precision health approaches given their strong dependence on high quality and easily accessible health datasets; (ii) to identify and explore the ethical, legal and social issues (ELSI) of innovative participatory disease surveillance systems – also falling under precision health approaches – and how research ethics are coping within this field. In addition, this thesis aims to strengthen the ethical approaches currently used to cater for these ELSIs by providing a robust ethical framework; and lastly, (iii) to investigate how precision health approaches might not be able to achieve their social justice and health equity goals, if the impact of structural racism on these initiatives is not given due consideration. After a careful assessment, this thesis provides recommendations and potential actions that could help these precision health approaches adhere to their social justice and health equity goals. This thesis has investigated these three main objectives using both empirical and theoretical research methods. The empirical branch consists of systematic and scoping reviews, both adhering to the PRISMA guidelines, and two interview-based studies carried out with Swiss expert stakeholders. The theoretical branch consists of three chapters, each addressing important aspects concerning precision health approaches

    The evaluation and harmonisation of disparate information metamodels in support of epidemiological and public health research

    Get PDF
    BACKGROUND: Descriptions of data, metadata, provide researchers with the contextual information they need to achieve research goals. Metadata enable data discovery, sharing and reuse, and are fundamental to managing data across the research data lifecycle. However, challenges associated with data discoverability negatively impact on the extent to which these data are known by the wider research community. This, when combined with a lack of quality assessment frameworks and limited awareness of the implications associated with poor quality metadata, are hampering the way in which epidemiological and public health research data are documented and repurposed. Furthermore, the absence of enduring metadata management models to capture consent for record linkage metadata in longitudinal studies can hinder researchers from establishing standardised descriptions of consent. AIM: To examine how metadata management models can be applied to ameliorate the use of research data within the context of epidemiological and public health research. METHODS: A combination of systematic literature reviews, online surveys and qualitative data analyses were used to investigate the current state of the art, identify current perceived challenges and inform creation and evaluation of the models. RESULTS: There are three components to this thesis: a) enhancing data discoverability; b) improving metadata quality assessment; and c) improving the capture of consent for record linkage metadata. First, three models were examined to enhance research data discoverability: data publications, linked data on the World Wide Web and development of an online public health portal. Second, a novel framework to assess epidemiological and public health metadata quality framework was created and evaluated. Third, a novel metadata management model to improve capture of consent for record linkage metadata was created and evaluated. CONCLUSIONS: Findings from these studies have contributed to a set of recommendations for change in research data management policy and practice to enhance stakeholders’ research environment
    corecore