28 research outputs found
Processamento automático de texto de narrativas clínicas
The informatization of medical systems and the subsequent move towards
the usage of Electronic Health Records (EHR) over the paper format by
medical professionals allowed for safer and more e cient healthcare. Additionally,
EHR can also be used as a data source for observational studies
around the world. However, it is estimated that 70-80% of all clinical data
is in the form of unstructured free text and regarding the data that is structured,
not all of it follows the same standards, making it di cult to use on
the mentioned observational studies.
This dissertation aims to tackle those two adversities using natural language
processing for the task of extracting concepts from free text and, afterwards,
use a common data model to harmonize the data. The developed system
employs an annotator, namely cTAKES, to extract the concepts from free
text. The extracted concepts are then normalized using text preprocessing,
word embeddings, MetaMap and UMLS Metathesaurus lookup. Finally, the
normalized concepts are converted to the OMOP Common Data Model and
stored in a database.
In order to test the developed system, the i2b2 2010 data set was used.
The di erent components of the system were tested and evaluated separately,
with the concept extraction component achieving a precision, recall
and F-score of 77.12%, 70.29% and 73.55%, respectively. The normalization
component was evaluated by completing the N2C2 2019 challenge
track 3, where it achieved a 77.5% accuracy. Finally, during the OMOP
CDM conversion component, it was observed that 7.92% of the concepts
were lost during the process. In conclusion, even though the developed system
still has margin for improvements, it proves to be a viable method of
automatically processing clinical narratives.A informatização dos sistemas médicos e a subsequente tendência por parte
de profissionais de saúde a substituir registos em formato de papel por registos
eletrónicos de saúde, permitiu que os serviços de saúde se tornassem
mais seguros e eficientes. Além disso, estes registos eletrónicos apresentam
também o benefício de poderem ser utilizados como fonte de dados para estudos
observacionais. No entanto, estima-se que 70-80% de todos os dados
clínicos se encontrem na forma de texto livre não-estruturado e os dados
que estão estruturados não seguem todos os mesmos padrões, dificultando
o seu potencial uso nos estudos observacionais.
Esta dissertação pretende solucionar essas duas adversidades através do uso
de processamento de linguagem natural para a tarefa de extrair conceitos
de texto livre e, de seguida, usar um modelo comum de dados para os harmonizar.
O sistema desenvolvido utiliza um anotador, especificamente o
cTAKES, para extrair conceitos de texto livre. Os conceitos extraídos são,
então, normalizados através de técnicas de pré-processamento de texto,
Word Embeddings, MetaMap e um sistema de procura no Metathesaurus
do UMLS. Por fim, os conceitos normalizados são convertidos para o modelo
comum de dados da OMOP e guardados numa base de dados.
Para testar o sistema desenvolvido usou-se o conjunto de dados i2b2 de
2010. As diferentes partes do sistema foram testadas e avaliadas individualmente
sendo que na extração dos conceitos obteve-se uma precisão, recall e
F-score de 77.12%, 70.29% e 73.55%, respetivamente. A normalização foi
avaliada através do desafio N2C2 2019-track 3 onde se obteve uma exatidão
de 77.5%. Na conversão para o modelo comum de dados OMOP observou-se
que durante a conversão perderam-se 7.92% dos conceitos. Concluiu-se
que, embora o sistema desenvolvido ainda tenha margem para melhorias,
este demonstrou-se como um método viável de processamento automático
do texto de narrativas clínicas.Mestrado em Engenharia de Computadores e Telemátic
An Interoperability Platform Enabling Reuse of Electronic Health Records for Signal Verification Studies
Depending mostly on voluntarily sent spontaneous reports, pharmacovigilance studies are hampered by low quantity and quality of patient data. Our objective is to improve postmarket safety studies by enabling safety analysts to seamlessly access a wide range of EHR sources for collecting deidentified medical data sets of selected patient populations and tracing the reported incidents back to original EHRs. We have developed an ontological framework where EHR sources and target clinical research systems can continue using their own local data models, interfaces, and terminology systems, while structural interoperability and Semantic Interoperability are handled through rule-based reasoning on formal representations of different models and terminology systems maintained in the SALUS Semantic Resource Set. SALUS Common Information Model at the core of this set acts as the common mediator. We demonstrate the capabilities of our framework through one of the SALUS safety analysis tools, namely, the Case Series Characterization Tool, which have been deployed on top of regional EHR Data Warehouse of the Lombardy Region containing about 1 billion records from 16 million patients and validated by several pharmacovigilance researchers with real-life cases. The results confirm significant improvements in signal detection and evaluation compared to traditional methods with the missing background information
Evaluation of the Privacy Risks of Personal Health Identifiers and Quasi-Identifiers in a Distributed Research Network: Development and Validation Study
Background: Privacy should be protected in medical data that include patient information. A distributed research network (DRN) is one of the challenges in privacy protection and in the encouragement of multi-institutional clinical research. A DRN standardizes multi-institutional data into a common structure and terminology called a common data model (CDM), and it only shares analysis results. It is necessary to measure how a DRN protects patient information privacy even without sharing data in practice.
Objective: This study aimed to quantify the privacy risk of a DRN by comparing different deidentification levels focusing on personal health identifiers (PHIs) and quasi-identifiers (QIs).
Methods: We detected PHIs and QIs in an Observational Medical Outcomes Partnership (OMOP) CDM as threatening privacy, based on 18 Health Insurance Portability and Accountability Act of 1996 (HIPPA) identifiers and previous studies. To compare the privacy risk according to the different privacy policies, we generated limited and safe harbor data sets based on 16 PHIs and 12 QIs as threatening privacy from the Synthetic Public Use File 5 Percent (SynPUF5PCT) data set, which is a public data set of the OMOP CDM. With minimum cell size and equivalence class methods, we measured the privacy risk reduction with a trust differential gap obtained by comparing the two data sets. We also measured the gap in randomly sampled records from the two data sets to adjust the number of PHI or QI records.
Results: The gaps averaged 31.448% and 73.798% for PHIs and QIs, respectively, with a minimum cell size of one, which represents a unique record in a data set. Among PHIs, the national provider identifier had the highest gap of 71.236% (71.244% and 0.007% in the limited and safe harbor data sets, respectively). The maximum size of the equivalence class, which has the largest size of an indistinguishable set of records, averaged 771. In 1000 random samples of PHIs, Device_exposure_start_date had the highest gap of 33.730% (87.705% and 53.975% in the data sets). Among QIs, Death had the highest gap of 99.212% (99.997% and 0.784% in the data sets). In 1000, 10,000, and 100,000 random samples of QIs, Device_treatment had the highest gaps of 12.980% (99.980% and 87.000% in the data sets), 60.118% (99.831% and 39.713%), and 93.597% (98.805% and 5.207%), respectively, and in 1 million random samples, Death had the highest gap of 99.063% (99.998% and 0.934% in the data sets).
Conclusions: In this study, we verified and quantified the privacy risk of PHIs and QIs in the DRN. Although this study used limited PHIs and QIs for verification, the privacy limitations found in this study could be used as a quality measurement index for deidentification of multi-institutional collaboration research, thereby increasing DRN safety.ope
Postmarketing Safety Study Tool: A Web Based, Dynamic, and Interoperable System for Postmarketing Drug Surveillance Studies
Postmarketing drug surveillance is a crucial aspect of the clinical research activities in pharmacovigilance and pharmacoepidemiology. Successful utilization of available Electronic Health Record (EHR) data can complement and strengthen postmarketing safety studies. In terms of the secondary use of EHRs, access and analysis of patient data across different domains are a critical factor; we address this data interoperability problem between EHR systems and clinical research systems in this paper. We demonstrate that this problem can be solved in an upper level with the use of common data elements in a standardized fashion so that clinical researchers can work with different EHR systems independently of the underlying information model. Postmarketing Safety Study Tool lets the clinical researchers extract data from different EHR systems by designing data collection set schemas through common data elements. The tool interacts with a semantic metadata registry through IHE data element exchange profile. Postmarketing Safety Study Tool and its supporting components have been implemented and deployed on the central data warehouse of the Lombardy region, Italy, which contains anonymized records of about 16 million patients with over 10-year longitudinal data on average. Clinical researchers in Roche validate the tool with real life use cases.Publisher's Versio
A model not a prophet:Operationalising patient-level prediction using observational data networks
Improving prediction model developement and evaluation processes using observational health data
A model not a prophet:Operationalising patient-level prediction using observational data networks
Improving prediction model developement and evaluation processes using observational health data
Doctor of Philosophy
dissertationClinical research plays a vital role in producing knowledge valuable for understanding human disease and improving healthcare quality. Human subject protection is an obligation essential to the clinical research endeavor, much of which is governed by federal regulations and rules. Institutional Review Boards (IRBs) are responsible for overseeing human subject research to protect individuals from harm and to preserve their rights. Researchers are required to submit and maintain an IRB application, which is an important component in the clinical research process that can significantly affect the timeliness and ethical quality of the study. As clinical research has expanded in both volume and scope over recent years, IRBs are facing increasing challenges in providing efficient and effective oversight. The Clinical Research Informatics (CRI) domain has made significant efforts to support various aspects of clinical research through developing information systems and standards. However, information technology use by IRBs has not received much attention from the CRI community. This dissertation project analyzed over 100 IRB application systems currently used at major academic institutions in the United States. The varieties of system types and lack of standardized application forms across institutions are discussed in detail. The need for building an IRB domain analysis model is identified. . iv In this dissertation, I developed an IRB domain analysis model with a special focus on promoting interoperability among CRI systems to streamline the clinical research workflow. The model was evaluated by a comparison with five real-world IRB application systems. Finally, a prototype implementation of the model was demonstrated by the integration of an electronic IRB system with a health data query system. This dissertation project fills a gap in the research of information technology use for the IRB oversight domain. Adoption of the IRB domain analysis model has potential to enhance efficient and high-quality ethics oversight and to streamline the clinical research workflow
The IeDEA Harmonist Data Toolkit: A Data Quality and Data Sharing Solution for a Global HIV Research Consortium.
We describe the design, implementation, and impact of a data harmonization, data quality checking, and dynamic report generation application in an international observational HIV research network. The IeDEA Harmonist Data Toolkit is a web-based application written in the open source programming language R, employs the R/Shiny and RMarkdown packages, and leverages the REDCap data collection platform for data model definition and user authentication. The Toolkit performs data quality checks on uploaded datasets, checks for conformance with the network's common data model, displays the results both interactively and in downloadable reports, and stores approved datasets in secure cloud storage for retrieval by the requesting investigator. Including stakeholders and users in the design process was key to the successful adoption of the application. A survey of regional data managers as well as initial usage metrics indicate that the Toolkit saves time and results in improved data quality, with a 61% mean reduction in number of error records in a dataset. The generalized application design allows the Toolkit to be easily adapted to other research networks
Arquiteturas federadas para integração de dados biomédicos
Doutoramento Ciências da ComputaçãoThe last decades have been characterized by a continuous adoption of
IT solutions in the healthcare sector, which resulted in the proliferation
of tremendous amounts of data over heterogeneous systems. Distinct
data types are currently generated, manipulated, and stored, in the
several institutions where patients are treated. The data sharing and an
integrated access to this information will allow extracting relevant
knowledge that can lead to better diagnostics and treatments.
This thesis proposes new integration models for gathering information
and extracting knowledge from multiple and heterogeneous biomedical
sources.
The scenario complexity led us to split the integration problem according
to the data type and to the usage specificity. The first contribution is a
cloud-based architecture for exchanging medical imaging services. It
offers a simplified registration mechanism for providers and services,
promotes remote data access, and facilitates the integration of
distributed data sources. Moreover, it is compliant with international
standards, ensuring the platform interoperability with current medical
imaging devices. The second proposal is a sensor-based architecture
for integration of electronic health records. It follows a federated
integration model and aims to provide a scalable solution to search and
retrieve data from multiple information systems. The last contribution is
an open architecture for gathering patient-level data from disperse and
heterogeneous databases. All the proposed solutions were deployed
and validated in real world use cases.A adoção sucessiva das tecnologias de comunicação e de informação
na área da saúde tem permitido um aumento na diversidade e na
qualidade dos serviços prestados, mas, ao mesmo tempo, tem gerado
uma enorme quantidade de dados, cujo valor científico está ainda por
explorar. A partilha e o acesso integrado a esta informação poderá
permitir a identificação de novas descobertas que possam conduzir a
melhores diagnósticos e a melhores tratamentos clínicos.
Esta tese propõe novos modelos de integração e de exploração de
dados com vista à extração de conhecimento biomédico a partir de
múltiplas fontes de dados.
A primeira contribuição é uma arquitetura baseada em nuvem para
partilha de serviços de imagem médica. Esta solução oferece um
mecanismo de registo simplificado para fornecedores e serviços,
permitindo o acesso remoto e facilitando a integração de diferentes
fontes de dados. A segunda proposta é uma arquitetura baseada em
sensores para integração de registos electrónicos de pacientes. Esta
estratégia segue um modelo de integração federado e tem como
objetivo fornecer uma solução escalável que permita a pesquisa em
múltiplos sistemas de informação. Finalmente, o terceiro contributo é
um sistema aberto para disponibilizar dados de pacientes num contexto
europeu. Todas as soluções foram implementadas e validadas em
cenários reais