    Ontology-based data integration between clinical and research systems

    Data from the electronic medical record comprise numerous structured but uncoded elements, which are not linked to standard terminologies. Reuse of such data for secondary research purposes has gained in importance recently. However, the identification of relevant data elements and the creation of database jobs for extraction, transformation and loading (ETL) are challenging: With current methods such as data warehousing, it is not feasible to efficiently maintain and reuse semantically complex data extraction and trans-formation routines. We present an ontology-supported approach to overcome this challenge by making use of abstraction: Instead of defining ETL procedures at the database level, we use ontologies to organize and describe the medical concepts of both the source system and the target system. Instead of using unique, specifically developed SQL statements or ETL jobs, we define declarative transformation rules within ontologies and illustrate how these constructs can then be used to automatically generate SQL code to perform the desired ETL procedures. This demonstrates how a suitable level of abstraction may not only aid the interpretation of clinical data, but can also foster the reutilization of methods for un-locking it

    Systematizing FAIR research data management in biomedical research projects: a data life cycle approach

    Biomedical researchers are facing data management challenges brought by a new generation of data driven by the advent of translational medicine research. These challenges are further complicated by the recent calls for data re-use and long-term stewardship spearheaded by the FAIR principles initiative. As a result, there is an increasingly wide-spread recognition that advancing biomedical science is becoming dependent on the application of data science to manage and utilize highly diverse and complex data in ways that give it context, meaning, and longevity beyond its initial purpose. However, current methods and practices in biomedical informatics remain to adopt a traditional linear view of the informatics process (collect, store and analyse); focusing primarily on the challenges in data integration and analysis, which are challenges only pertaining to a part of the overall life cycle of research data. The aim of this research is to facilitate the adoption and integration of data management practices into the research life cycle of biomedical projects, thus improving their capabilities into solving data management-related challenges that they face throughout the course of their research work. To achieve this aim, this thesis takes a data life cycle approach to define and develop a systematic methodology and framework towards the systematization of FAIR data management in biomedical research projects. The overarching contribution of this research is the provision of a data-state life cycle model for research data management in Biomedical Translational Research Projects. This model provides insight into the dynamics between 1) the purpose of a research-driven data use case, 2) the data requirements that renders data in a state fit for purpose, 3) the data management functions that prepare and act upon data and 4) the resulting state of data that is _t to serve the use case. This insight led to the development of a FAIR data management framework, which is another contribution of this thesis. This framework provides data managers the groundwork, including the data models, resources and capabilities, needed to build a FAIR data management environment to manage data during the operational stages of a biomedical research project. An exemplary implementation of this architecture (PlatformTM) was developed and validated by real-world research datasets produced by collaborative research programs funded by the Innovative Medicine Initiative (IMI) BioVacSafe 1 , eTRIKS 2 and FAIRplus 3.Open Acces

    Cohort Identification Using Semantic Web Technologies: Ontologies and Triplestores as Engines for Complex Computable Phenotyping

    Electronic health record (EHR)-based computable phenotypes are algorithms used to identify individuals or populations with clinical conditions or events of interest within a clinical data repository. Due to a lack of EHR data standardization, computable phenotypes can be semantically ambiguous and difficult to share across institutions. In this research, I propose a new computable phenotyping methodological framework based on semantic web technologies, specifically ontologies, the Resource Description Framework (RDF) data format, triplestores, and Web Ontology Language (OWL) reasoning. My hypothesis is that storing and analyzing clinical data using these technologies can begin to address the critical issues of semantic ambiguity and lack of interoperability in the context of computable phenotyping. To test this hypothesis, I compared the performance of two variants of two computable phenotypes (for depression and rheumatoid arthritis, respectively). The first variant of each phenotype used a list of ICD-10-CM codes to define the condition; the second variant used ontology concepts from SNOMED and the Human Phenotype Ontology (HPO). After executing each variant of each phenotype against a clinical data repository, I compared the patients matched in each case to see where the different variants overlapped and diverged. Both the ontologies and the clinical data were stored in an RDF triplestore to allow me to assess the interoperability advantages of the RDF format for clinical data. All tested methods successfully identified cohorts in the data store, with differing rates of overlap and divergence between variants. Depending on the phenotyping use case, SNOMED and HPO’s ability to more broadly define many conditions due to complex relationships between their concepts may be seen as an advantage or a disadvantage. I also found that RDF triplestores do indeed provide interoperability advantages, despite being far less commonly used in clinical data applications than relational databases. Despite the fact that these methods and technologies are not “one-size-fits-all,” the experimental results are encouraging enough for them to (1) be put into practice in combination with existing phenotyping methods or (2) be used on their own for particularly well-suited use cases.Doctor of Philosoph

    Facilitating and Enhancing Biomedical Knowledge Translation: An in Silico Approach to Patient-centered Pharmacogenomic Outcomes Research

    Current research paradigms such as traditional randomized control trials mostly rely on relatively narrow efficacy data which results in high internal validity and low external validity. Given this fact and the need to address many complex real-world healthcare questions in short periods of time, alternative research designs and approaches should be considered in translational research. In silico modeling studies, along with longitudinal observational studies, are considered as appropriate feasible means to address the slow pace of translational research. Taking into consideration this fact, there is a need for an approach that tests newly discovered genetic tests, via an in silico enhanced translational research model (iS-TR) to conduct patient-centered outcomes research and comparative effectiveness research studies (PCOR CER). In this dissertation, it was hypothesized that retrospective EMR analysis and subsequent mathematical modeling and simulation prediction could facilitate and accelerate the process of generating and translating pharmacogenomic knowledge on comparative effectiveness of anticoagulation treatment plan(s) tailored to well defined target populations which eventually results in a decrease in overall adverse risk and improve individual and population outcomes. To test this hypothesis, a simulation modeling framework (iS-TR) was proposed which takes advantage of the value of longitudinal electronic medical records (EMRs) to provide an effective approach to translate pharmacogenomic anticoagulation knowledge and conduct PCOR CER studies. The accuracy of the model was demonstrated by reproducing the outcomes of two major randomized clinical trials for individualizing warfarin dosing. A substantial, hospital healthcare use case that demonstrates the value of iS-TR when addressing real world anticoagulation PCOR CER challenges was also presented

    Enriching information extraction pipelines in clinical decision support systems

    Programa Oficial de Doutoramento en Tecnoloxías da Información e as Comunicacións. 5032V01 Validamos a nosa proposta utilizando conxuntos de datos de pacientes con enfermidade de Alzheimer procedentes de distintas institucións. Na seguinte etapa, co obxectivo de enriquecer a información almacenada nas bases de datos de OMOP CDM, investigamos solucións para extraer conceptos clínicos de narrativas non estruturadas, utilizando técnicas de recuperación de información e de procesamento da linguaxe natural. A validación realizouse a través de conxuntos de datos proporcionados en desafíos científicos, concretamente no National NLP Clinical Challenges(n2c2). Na etapa final, propuxémonos simplificar a execución de protocolos de estudos provenientes de múltiples centros, propoñendo solucións novas para perfilar, publicar e facilitar o descubrimento de bases de datos. Algunhas das solucións desenvolvidas están a utilizarse actualmente en tres proxectos europeos destinados a crear redes federadas de bases de datos de saúde en toda Europa.[Resumen] Los estudios sanitarios de múltiples centros son importantes para aumentar la repercusión de los resultados de la investigación médica debido al número de sujetos que pueden participar en ellos. Para simplificar la ejecución de estos estudios, el proceso de intercambio de datos debería ser sencillo, por ejemplo, mediante el uso de bases de datos interoperables. Sin embargo, la consecución de esta interoperabilidad sigue siendo un tema de investigación en curso, sobre todo debido a los problemas de gobernanza y privacidad de los datos. En la primera fase de este trabajo, proponemos varias metodologías para optimizar los procesos de estandarización de las bases de datos sanitarias. Este trabajo se centró en la estandarización de fuentes de datos heterogéneas en un esquema de datos estándar, concretamente el OMOP CDM, que ha sido desarrollado y promovido por la comunidad OHDSI. Validamos nuestra propuesta utilizando conjuntos de datos de pacientes con enfermedad de Alzheimer procedentes de distintas instituciones. En la siguiente etapa, con el objetivo de enriquecer la información almacenada en las bases de datos de OMOP CDM, hemos investigado soluciones para extraer conceptos clínicos de narrativas no estructuradas, utilizando técnicas de recuperación de información y de procesamiento del lenguaje natural. La validación se realizó a través de conjuntos de datos proporcionados en desafíos científicos, concretamente en el National NLP Clinical Challenges (n2c2). [Abstract] Multicentre health studies are important to increase the impact of medical research findings due to the number of subjects that they are able to engage. To simplify the execution of these studies, the data-sharing process should be effortless, for instance, through the use of interoperable databases. However, achieving this interoperability is still an ongoing research topic, namely due to data governance and privacy issues. In the first stage of this work, we propose several methodologies to optimise the harmonisation pipelines of health databases. This work was focused on harmonising heterogeneous data sources into a standard data schema, namely the OMOP CDM which has been developed and promoted by the OHDSI community. We validated our proposal using data sets of Alzheimer’s disease patients from distinct institutions. In the following stage, aiming to enrich the information stored in OMOP CDM databases, we have investigated solutions to extract clinical concepts from unstructured narratives, using information retrieval and natural language processing techniques. The validation was performed through datasets provided in scientific challenges, namely in the National NLP Clinical Challenges (n2c2). In the final stage, we aimed to simplify the protocol execution of multicentre studies, by proposing novel solutions for profiling, publishing and facilitating the discovery of databases. Some of the developed solutions are currently being used in three European projects aiming to create federated networks of health databases across Europe

    Privacy-Enhancing Technologies for Medical and Genomic Data: From Theory to Practice

    The impressive technological advances in genomic analysis and the significant drop in the cost of genome sequencing are paving the way to a variety of revolutionary applications in modern healthcare. In particular, the increasing understanding of the human genome, and of its relation to diseases, health and to responses to treatments brings promise of improvements in better preventive and personalized medicine. Unfortunately, the impact on privacy and security is unprecedented. The genome is our ultimate identifier and, if leaked, it can unveil sensitive and personal information such as our genetic diseases, our propensity to develop certain conditions (e.g., cancer or Alzheimer's) or the health issues of our family. Even though legislation, such as the EU General Data Protection Regulation (GDPR) or the US Health Insurance Portability and Accountability Act (HIPAA), aims at mitigating abuses based on genomic and medical data, it is clear that this information also needs to be protected by technical means. In this thesis, we investigate the problem of developing new and practical privacy-enhancing technologies (PETs) for the protection of medical and genomic data. Our goal is to accelerate the adoption of PETs in the medical field in order to address the privacy and security concerns that prevent personalized medicine from reaching its full potential. We focus on two main areas of personalized medicine: clinical care and medical research. For clinical care, we first propose a system for securely storing and selectively retrieving raw genomic data that is indispensable for in-depth diagnoses and treatments of complex genetic diseases such as cancer. Then, we focus on genetic variants and devise a new model based on additively-homomorphic encryption for privacy-preserving genetic testing in clinics. Our model, implemented in the context of HIV treatment, is the first to be tested and evaluated by practitioners in a real operational setting. For medical research, we first propose a method that combines somewhat-homomorphic encryption with differential privacy to enable secure feasibility studies on genetic data stored at an untrusted central repository. Second, we address the problem of sharing genomic and medical data when the data is distributed across multiple mistrustful institutions. We begin by analyzing the risks that threaten patientsâ privacy in systems for the discovery of genetic variants, and we propose practical mitigations to the re-identification risk. Then, for clinical sites to be able to share the data without worrying about the risk of data breaches, we develop a new system based on collective homomorphic encryption: it achieves trust decentralization and enables researchers to securely find eligible patients for clinical studies. Finally, we design a new framework, complementary to the previous ones, for quantifying the risk of unintended disclosure caused by potential inference attacks that are jointly combined by a malicious adversary, when exact genomic data is shared. In summary, in this thesis we demonstrate that PETs, still believed unpractical and immature, can be made practical and can become real enablers for overcoming the privacy and security concerns blocking the advancement of personalized medicine. Addressing privacy issues in healthcare remains a great challenge that will increasingly require long-term collaboration among geneticists, healthcare providers, ethicists, lawmakers, and computer scientists

    Classifying Relations using Recurrent Neural Network with Ontological-Concept Embedding

    Relation extraction and classification represents a fundamental and challenging aspect of Natural Language Processing (NLP) research which depends on other tasks such as entity detection and word sense disambiguation. Traditional relation extraction methods based on pattern-matching using regular expressions grammars and lexico-syntactic pattern rules suffer from several drawbacks including the labor involved in handcrafting and maintaining large number of rules that are difficult to reuse. Current research has focused on using Neural Networks to help improve the accuracy of relation extraction tasks using a specific type of Recurrent Neural Network (RNN). A promising approach for relation classification uses an RNN that incorporates an ontology-based concept embedding layer in addition to word embeddings. This dissertation presents several improvements to this approach by addressing its main limitations. First, several different types of semantic relationships between concepts are incorporated into the model; prior work has only considered is-a hierarchical relationships. Secondly, a significantly larger vocabulary of concepts is used. Thirdly, an improved method for concept matching was devised. The results of adding these improvements to two state-of-the-art baseline models demonstrated an improvement to accuracy when evaluated on benchmark data used in prior studies

    Natural Language Processing: Emerging Neural Approaches and Applications

    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains