626 research outputs found
Pseudo-data Generation For Improving Clinical Named Entity Recognition
One of the primary challenges for clinical Named Entity Recognition (NER) is the availability of annotated training data. Technical and legal hurdles prevent the creation and release of corpora related to electronic health records (EHRs). In this work, we look at the imapct of pseudo-data generation on clinical NER using gazetteering and thresholding utilizing a neural network model. We report that gazetteers can result in the inclusion of proper terms with the exclusion of determiners and pronouns in preceding and middle positions. Gazetteers that had higher numbers of terms inclusive to the original dataset had a higher impact. We also report that thresholding results in clear trend lines across the thresholds with some values oscillating around a fixed point at the most confidence points
Machine learning approaches to identifying social determinants of health in electronic health record clinical notes
Social determinants of health (SDH) represent the complex set of circumstances in which individuals are born, or with which they live, that impact health. Relatively little attention has been given to processes needed to extract SDH data from electronic health records. Despite their importance, SDH data in the EHR remains sparse, typically collected only in clinical notes and thus largely unavailable for clinical decision making. I focus on developing and validating more efficient information extraction approaches to identifying and classifying SDH in clinical notes. In this dissertation, I have three goals: First, I develop a word embedding model to expand SDH terminology in the context of identifying SDH clinical text. Second, I examine the effectiveness of different machine learning algorithms and a neural network model to classify the SDH characteristics financial resource strain and poor social support. Third, I compare the highest performing approaches to simpler text mining techniques and evaluate the models based on performance, cost, and generalizability in the task of classifying SDH in two distinct data sources.Doctor of Philosoph
Health systems data interoperability and implementation
Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers.
Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study.
Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus.
Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration.
Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing
Improving patient record search
Improving health search is a wide context which concerns the effectiveness of Information Retrieval (IR) systems (also called search engines) while providing grounds for the creation of reliable test collections. In this research we analyse IR and Text Processing methods to improve health search mainly that of Electronic Patient Records (EPR). We also propose a novel approach to evaluate IR systems, that unlike traditional IR evaluation does not rely on human relevance judgement. We find that our meta-data based method is more effective than query expansion using external knowledge sources, and that our simulated relevance judgments have a positive correlation with man-made relevance judgements
Record linkage of population-based cohort data from minors with national register data : A scoping review and comparative legal analysis of four European countries
Funding Information: We would like to acknowledge Evert-Ben van Veen from the MLC Foundation, Dagelijkse Groenmarkt 2, 2513 AL Den Haag, the Netherlands. The results on the country-specific text on the Netherlands was based on his contribution. Publisher Copyright: © 2021 Doetsch JN et al.Background: The GDPR was implemented to build an overarching framework for personal data protection across the EU/EEA. Linkage of data directly collected from cohort participants, potentially serving as a prominent tool for health research, must respect data protection rules and privacy rights. Our objective was to investigate law possibilities of linking cohort data of minors with routinely collected education and health data comparing EU/EEA member states. Methods: A legal comparative analysis and scoping review was conducted of openly accessible published laws and regulations in EUR-Lex and national law databases on GDPR's implementation in Portugal, Finland, Norway, and the Netherlands and its connected national regulations purposing record linkage for health research that have been implemented up until April 30, 2021. Results: The GDPR does not ensure total uniformity in data protection legislation across member states offering flexibility for national legislation. Exceptions to process personal data, e.g., public interest and scientific research, must be laid down in EU/EEA or national law. Differences in national interpretation caused obstacles in cross-national research and record linkage: Portugal requires written consent and ethical approval; Finland allows linkage mostly without consent through the national Social and Health Data Permit Authority; Norway when based on regional ethics committee's approval and adequate information technology safeguarding confidentiality; the Netherlands mainly bases linkage on the opt-out system and Data Protection Impact Assessment. Conclusions: Though the GDPR is the most important legal framework, national legislation execution matters most when linking cohort data with routinely collected health and education data. As national interpretation varies, legal intervention balancing individual right to informational self-determination and public good is gravely needed for health research. More harmonization across EU/EEA could be helpful but should not be detrimental in those member states which already opened a leeway for registries and research for the public good without explicit consent.Peer reviewe
Repeatable and reusable research - Exploring the needs of users for a Data Portal for Disease Phenotyping
Background: Big data research in the field of health sciences is hindered by a lack of agreement on how to identify and define different conditions and their medications. This means that researchers and health professionals often have different phenotype definitions for the same condition. This lack of agreement makes it hard to compare different study findings and hinders the ability to conduct repeatable and reusable research. Objective: This thesis aims to examine the requirements of various users, such as researchers, clinicians, machine learning experts, and managers, for both new and existing data portals for phenotypes (concept libraries). Methods: Exploratory sequential mixed methods were used in this thesis to look at which concept libraries are available, how they are used, what their characteristics are, where there are gaps, and what needs to be done in the future from the point of view of the people who use them. This thesis consists of three phases: 1) two qualitative studies, including one-to-one interviews with researchers, clinicians, machine learning experts, and senior research managers in health data science, as well as focus group discussions with researchers working with the Secured Anonymized Information Linkage databank, 2) the creation of an email survey (i.e., the Concept Library Usability Scale), and 3) a quantitative study with researchers, health professionals, and clinicians. Results: Most of the participants thought that the prototype concept library would be a very helpful resource for conducting repeatable research, but they specified that many requirements are needed before its development. Although all the participants stated that they were aware of some existing concept libraries, most of them expressed negative perceptions about them. The participants mentioned several facilitators that would encourage them to: 1) share their work, such as receiving citations from other researchers; and 2) reuse the work of others, such as saving a lot of time and effort, which they frequently spend on creating new code lists from scratch. They also pointed out several barriers that could inhibit them from: 1) sharing their work, such as concerns about intellectual property (e.g., if they shared their methods before publication, other researchers would use them as their own); and 2) reusing others' work, such as a lack of confidence in the quality and validity of their code lists. Participants suggested some developments that they would like to see happen in order to make research that is done with routine data more reproducible, such as the availability of a drive for more transparency in research methods documentation, such as publishing complete phenotype definitions and clear code lists. Conclusions: The findings of this thesis indicated that most participants valued a concept library for phenotypes. However, only half of the participants felt that they would contribute by providing definitions for the concept library, and they reported many barriers regarding sharing their work on a publicly accessible platform such as the CALIBER research platform. Analysis of interviews, focus group discussions, and qualitative studies revealed that different users have different requirements, facilitators, barriers, and concerns about concept libraries. This work was to investigate if we should develop concept libraries in Kuwait to facilitate the development of improved data sharing. However, at the end of this thesis the recommendation is this would be unlikely to be cost effective or highly valued by users and investment in open access research publications may be of more value to the Kuwait research/academic community
Recommended from our members
Novel reversible text data de-identification techniques based on native data structures
Technological development in today's digital world has resulted in the collection and storage of large amounts of personal data. These data enable both direct services and non-direct activities, known as secondary use. The secondary use of data can improve decision-making, service experiences, and healthcare systems. However, the widespread reuse of personal data raises significant privacy and policy issues, especially for health- related information; these data may contain sensitive data, leading to privacy breaches if compromised. Legal systems establish laws to protect the privacy of personal data disclosed for secondary use. A well-known example is the General Data Protection Regulation (GDPR), which outlines a specific set of rules for sharing and storing personal data to protect individual privacy. The GDPR explicitly points to data de-identification, especially pseudonymization, as one measure that can help meet the requirements for the processing of personal data.
The literature on privacy preservation approaches has largely been developed in the field of data anonymization, where personal data are irreversibly removed or obfuscated and there is no means by which to recover an individual's identity if needed. By contrast, pseudonymization is a promising technique to protect privacy while enabling the recovery of de-identified data. Significantly, many existing approaches for pseudonymization were developed long before the GDPR requirements were established, and so they may fail to satisfy its provisions. Therefore, it is worthwhile to offer technical solutions to preserve privacy while supporting the legitimate use of data.
This thesis proposes a novel de-identification system for unstructured textual data, known as ARTPHIL, that generates de-identified data in compliance with the GDPR requirement for strong pseudonymization. The system was evaluated using 2014 i2b2 testing data. The proposed system achieved a recall of 96.93% in terms of detecting and encrypting personal health information, as specified under guidelines provided by the Health Insurance Portability and Accountability Act (HIPAA). The system used a novel and lightweight cryptography algorithm E-ART to encrypt personal data cost-effectively and without compromising security. The main novelty of the E-ART algorithm is the use of the reflection property of a balanced binary tree data structure as substitution method instead of complex and multiple iterations. The performance and security of the proposed algorithm were compared to two symmetric encryption algorithms: The Advanced Encryption Standard and Data Encryption Standard. The security analysis showed comparable results, but the performance analysis indicated that E‐ART had the shortest ciphertext and running time with comparable memory usage, which indicates the feasibility of using ARTPHIL for delay-sensitive or data-intensive application
PandeMedia: an annotated corpus of digital media for issue salience
Tese de mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasThe ubiquitous sharing of information via the Internet has shifted much of society’s communication and information-seeking to digital spaces, such as news websites and social networks. As the Web
represents a massive hub of information dissemination and discussion, it has also made possible the extraction of great amounts of highly detailed data to answer complex questions on human behaviour and
preferences.
This shift towards online life was exaggerated during the earlier phases of the COVID-19 pandemic,
when many countries were in lockdown and in-person contact was severely limited. Therefore, in addition
to the ongoing political, economic, and public health crisis, there were, on the one hand, new opportunities
to study human behaviour thought digital data, including support for public health measures or trust in
science, while, on the other hand, the deluge of new data and the fast-changing nature of the pandemic
created new challenges to data science research, particularly the need to build quality pipelines for data
extraction, collection, and future analysis.
In this thesis, we focus on the important issue of salience of science and scientists during a health
crisis and ask how to build a pipeline to select, store, extract and analyse longitudinal digital media data,
that might allow for long-term study of media effects on salience. Therefore, this project has two main
components: first, we showcase a data pipeline that makes use of media and social media data, available
online, to build a media corpus of news and tweets with millions of documents, spanning billions of
tokens, corresponding to more than two years of coverage and multiple media sources and topics; second,
we show how this corpus can be leveraged to study the problem of salience, and use the visibility of
science during the earlier phases of the COVID-19 pandemic as a case-study, comparing between salience
in traditional versus social media. Overall, we present both a transparent and scaleable pipeline and a
specific application of this approach, to tackle the question of how science visibility changed during this
massive crisis. We use different media types and sources to potentiate text mining and other analytical
purposes, offering a digital data-centric computational methodology to investigate questions in the social
sciences.Os dados tomam, nos dias de hoje, um papel central no funcionamento das sociedades humanas. Com o
desenvolvimento das tecnologias digitais, aliadas à ubíqua conetividade à Internet, em particular à World
Wide Web (WWW), vivemos na chamada “era da informação” . Este paradigma da sociedade alicerça-se
no fenómeno tipicamente referido como datafication, que se refere ao processo já enraizado e inerente à
vida quotidiana através do qual a nossa atividade humana e formas de participação na sociedade são convertidas em dados. Esta produção em larga escala e em tempo real de dados funciona como o combustível
para um amplo leque de aplicações nos mais variados domínios, desde a indústria, à investigação científica, à saúde, entre outros. Deste modo, testemunhamos uma crescente procura, e mesmo necessidade,
de grandes coleções de dados, para alimentarem os diferentes setores de atividade.
A Web representa talvez o maior volume de dados amplamente disponível ao público em geral. É nos
websites e nas aplicações online que uma grande parte da população realiza diariamente um conjunto de
tarefas e ações, sejam estas de caráter profissional ou lúdico. Os nossos hábitos de consumo de informação
são assegurados predominantemente por estes espaços digitais, como as redes sociais ou as plataformas
digitais de media tradicionais. Da mesma forma, as nossas interações sociais mediadas por dispositivos
digitais são cada vez mais frequentes. A Web é, portanto, um reservatório de potenciais descobertas e de
informação valiosa, que pode ser eventualmente extraída através da exploração dos dados que contém.
Pela sua própria natureza, a Web levanta grandes desafios relativos às formas de capturar este valor
presente nos dados digitais. Enormes volumes de dados podem ser rapidamente e facilmente identificados e extraídos. No entanto, não existe um processo de acréscimo de valor a estes dados sem que passem
primeiramente por uma fase de organização. Para que seja possível extrair conhecimento dos dados obtidos, é necessário que estes apresentam a devida organização e qualidade. As maiores dificuldades nas
metodologias de colheita e gestão de dados digitais passam por assegurar precisamente esta qualidade. Os
dados da Web são naturalmente muito heterogéneos, visto resultarem da convergência de imensas fontes
de informação. São também, na sua maioria, não estruturados, nomeadamente em formatos textuais que
precisam de ser interpretados computacionalmente e compartimentalizados para facilitar futura análise.
Muitas vezes, existem também dados em falta ou que apresentam uma qualidade tão baixa que são inviáveis para as finalidades em mente. Para além destes fatores intrínsecos aos dados em si, as questões
que os rodeiam são também cruciais a considerar: a capacidade de detetar e localizar os dados pretendidos, a capacidade de aceder a estes dados, e o grau de disponibilidade destes dados, quando acessíveis.
Deve também ter-se em consideração as questões éticas, de privacidade e de direitos de autor associadas aos dados passíveis de serem colecionados. ... automatizar processos de colheita para fontes e tipos de
dados tão diversos quanto aqueles que se encontram disponíveis na Web.
A pandemia causada pelo SARS-CoV-2, agente da COVID-19, representa uma crise de enormes
proporções nas esferas política, económica e de saúde pública. Com a população do mundo restrita nos
seus comportamentos e hábitos de modo a prevenir um agravamento da propagação do vírus, as pessoas
recorreram ao digital como meio de comunicação e de obtenção e disseminação de informação (e desinformação). Assim, os media e as redes sociais foram relevantes pontos de convergência de uma grande
parte da atenção do público, levantando questões importantes sobre a perceção pública dos especialistas
científicos e sobre a saliência de certos tópicos de discussão.
Num contexto mais alargado, podemos perspetivar a crise pandémica como um desafio no domínio
das tecnologias da informação. No desenvolver desta emergência de saúde pública, temos vindo a ser
confrontados com vários dos desafios presentes em data science: dados complexos, na escala de populações inteiras, a serem produzidos em tempo real por múltiplas fontes, com diferentes estruturas e
formatos, e que sofrem uma rápida desatualização, requerem rápida análise, mas também processos de
limpeza e melhoramento robustos.
Todos estes fatores nos levam à nossa questão principal: numa crise que evolui tão rapidamente como
a pandemia da COVID-19, como podemos construir uma pipeline que nos permita responder aos desafios
da coleção e gestão de dados, de modo a criar um dataset de media digital para análise?
Para extrair os dados necessários, recorremos a três fontes distintas: a plataforma open-source Media
Cloud, a base de dados Internet Archive, e o API da rede social Twitter. Começámos por definir dezoito
tópicos distintos, constituídos por palavras-chaves para uso na pesquisa pelos artigos e posts de media.
Alguns tópicos são relacionados com a pandemia, enquanto outros funcionam como potenciais controlos
positivos e negativos. A coesão semântica de cada tópico foi assegurada através do uso da base de dados
léxica WordNet, que fornece significados e relações de palavras. Os metadados inicialmente obtidos
foram processados e utilizados para identificar as fontes primárias dos dados de notícias. A partir de Web
scraping, obtivemos dados brutos de artigos de media dos Estados Unidos da América disponíveis online,
de Janeiro de 2019 a Janeiro de 2021 (inclusive). Estes foram subsequentemente transformados, passando
por um processo de filtragem, limpeza e formatação, que é acompanhado de uma análise exploratória
dos dados e visualização de dados para efeitos de diagnóstico do processo completo. Os dados da rede
social foram extraídos através de um API próprio, especificando parâmetros para restringir resultados
aos Estados Unidos e ao intervalo de tempo anteriormente definido. Os dados devidamente tratados
foram posteriormente armazenados na base de dados desenhada e contruída para o propósito. A base
de dados foi concebida com quatro tabelas, que incluem os dados de notícias, os dados da rede social
Twitter, os metadados das pesquisas originais e metadados sobre as fontes das notícias, e feita através do
sistema de gestão de bases de dados PostgreSQL. Para otimizar o desempenho das pesquisas no nosso
conjunto de dados, procedemos à construção de índices para campos específicos, nomeadamente campos
de texto, que são o nosso interesse principal. Utilizando as funcionalidades disponíveis, foram construídas
representações vetoriais do texto das notícias, e a partir destas foi contruído um índice apropriado para
pesquisa em dados textuais, que reduziu o tempo de pesquisa por um fator nas dezenas de milhares
de vezes. Demonstramos ainda a pesquisa preliminar de dados longitudinais para efeitos de estudo da saliência de diferentes tópicos nos meios de comunicação. Foram aplicadas diferentes metodologias
estatísticas de análise de séries temporais para responder às questões a abordar. Através do uso de médias
móveis, os sinais foram clarificados para melhor visualização. Os testes de estacionaridade serviram
de diagnóstico para as transformações a aplicar aos dados de modo a garantir a validade de análises
posteriores. Com testes de causalidade de Granger, foi possível estabelecer relações entre séries temporais
com base no poder preditivo e assim compreender a dinâmica de interação de diferentes media. Com
recurso a técnicas de deteção de pontos de quebra, conseguimos defender a ideia de que existiram períodos
de mudança dos padrões observados nos media que coincidem com o despoletar da crise pandémica.
Assim, potenciada por uma pipeline customizada, robusta e transparente, conseguimos gerar um corpus de media, contendo milhões de documentos, que albergam milhares de milhões de tokens, correspondendo a um período de tempo superior a dois anos e múltiplas fontes de notícias e tópicos, permitindo
assim potenciar finalidades de mineração de texto (text mining) e outros propósitos analíticos, oferecendo uma metodologia computacional centrada nos dados digitais para investigar este tipo de questões
nas ciências sociais
- …