626 research outputs found

    Pseudo-data Generation For Improving Clinical Named Entity Recognition

    Get PDF
    One of the primary challenges for clinical Named Entity Recognition (NER) is the availability of annotated training data. Technical and legal hurdles prevent the creation and release of corpora related to electronic health records (EHRs). In this work, we look at the imapct of pseudo-data generation on clinical NER using gazetteering and thresholding utilizing a neural network model. We report that gazetteers can result in the inclusion of proper terms with the exclusion of determiners and pronouns in preceding and middle positions. Gazetteers that had higher numbers of terms inclusive to the original dataset had a higher impact. We also report that thresholding results in clear trend lines across the thresholds with some values oscillating around a fixed point at the most confidence points

    Machine learning approaches to identifying social determinants of health in electronic health record clinical notes

    Get PDF
    Social determinants of health (SDH) represent the complex set of circumstances in which individuals are born, or with which they live, that impact health. Relatively little attention has been given to processes needed to extract SDH data from electronic health records. Despite their importance, SDH data in the EHR remains sparse, typically collected only in clinical notes and thus largely unavailable for clinical decision making. I focus on developing and validating more efficient information extraction approaches to identifying and classifying SDH in clinical notes. In this dissertation, I have three goals: First, I develop a word embedding model to expand SDH terminology in the context of identifying SDH clinical text. Second, I examine the effectiveness of different machine learning algorithms and a neural network model to classify the SDH characteristics financial resource strain and poor social support. Third, I compare the highest performing approaches to simpler text mining techniques and evaluate the models based on performance, cost, and generalizability in the task of classifying SDH in two distinct data sources.Doctor of Philosoph

    Health systems data interoperability and implementation

    Get PDF
    Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers. Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study. Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus. Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration. Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing

    Improving patient record search

    Get PDF
    Improving health search is a wide context which concerns the effectiveness of Information Retrieval (IR) systems (also called search engines) while providing grounds for the creation of reliable test collections. In this research we analyse IR and Text Processing methods to improve health search mainly that of Electronic Patient Records (EPR). We also propose a novel approach to evaluate IR systems, that unlike traditional IR evaluation does not rely on human relevance judgement. We find that our meta-data based method is more effective than query expansion using external knowledge sources, and that our simulated relevance judgments have a positive correlation with man-made relevance judgements

    Record linkage of population-based cohort data from minors with national register data : A scoping review and comparative legal analysis of four European countries

    Get PDF
    Funding Information: We would like to acknowledge Evert-Ben van Veen from the MLC Foundation, Dagelijkse Groenmarkt 2, 2513 AL Den Haag, the Netherlands. The results on the country-specific text on the Netherlands was based on his contribution. Publisher Copyright: © 2021 Doetsch JN et al.Background: The GDPR was implemented to build an overarching framework for personal data protection across the EU/EEA. Linkage of data directly collected from cohort participants, potentially serving as a prominent tool for health research, must respect data protection rules and privacy rights. Our objective was to investigate law possibilities of linking cohort data of minors with routinely collected education and health data comparing EU/EEA member states. Methods: A legal comparative analysis and scoping review was conducted of openly accessible published laws and regulations in EUR-Lex and national law databases on GDPR's implementation in Portugal, Finland, Norway, and the Netherlands and its connected national regulations purposing record linkage for health research that have been implemented up until April 30, 2021. Results: The GDPR does not ensure total uniformity in data protection legislation across member states offering flexibility for national legislation. Exceptions to process personal data, e.g., public interest and scientific research, must be laid down in EU/EEA or national law. Differences in national interpretation caused obstacles in cross-national research and record linkage: Portugal requires written consent and ethical approval; Finland allows linkage mostly without consent through the national Social and Health Data Permit Authority; Norway when based on regional ethics committee's approval and adequate information technology safeguarding confidentiality; the Netherlands mainly bases linkage on the opt-out system and Data Protection Impact Assessment. Conclusions: Though the GDPR is the most important legal framework, national legislation execution matters most when linking cohort data with routinely collected health and education data. As national interpretation varies, legal intervention balancing individual right to informational self-determination and public good is gravely needed for health research. More harmonization across EU/EEA could be helpful but should not be detrimental in those member states which already opened a leeway for registries and research for the public good without explicit consent.Peer reviewe

    Repeatable and reusable research - Exploring the needs of users for a Data Portal for Disease Phenotyping

    Get PDF
    Background: Big data research in the field of health sciences is hindered by a lack of agreement on how to identify and define different conditions and their medications. This means that researchers and health professionals often have different phenotype definitions for the same condition. This lack of agreement makes it hard to compare different study findings and hinders the ability to conduct repeatable and reusable research. Objective: This thesis aims to examine the requirements of various users, such as researchers, clinicians, machine learning experts, and managers, for both new and existing data portals for phenotypes (concept libraries). Methods: Exploratory sequential mixed methods were used in this thesis to look at which concept libraries are available, how they are used, what their characteristics are, where there are gaps, and what needs to be done in the future from the point of view of the people who use them. This thesis consists of three phases: 1) two qualitative studies, including one-to-one interviews with researchers, clinicians, machine learning experts, and senior research managers in health data science, as well as focus group discussions with researchers working with the Secured Anonymized Information Linkage databank, 2) the creation of an email survey (i.e., the Concept Library Usability Scale), and 3) a quantitative study with researchers, health professionals, and clinicians. Results: Most of the participants thought that the prototype concept library would be a very helpful resource for conducting repeatable research, but they specified that many requirements are needed before its development. Although all the participants stated that they were aware of some existing concept libraries, most of them expressed negative perceptions about them. The participants mentioned several facilitators that would encourage them to: 1) share their work, such as receiving citations from other researchers; and 2) reuse the work of others, such as saving a lot of time and effort, which they frequently spend on creating new code lists from scratch. They also pointed out several barriers that could inhibit them from: 1) sharing their work, such as concerns about intellectual property (e.g., if they shared their methods before publication, other researchers would use them as their own); and 2) reusing others' work, such as a lack of confidence in the quality and validity of their code lists. Participants suggested some developments that they would like to see happen in order to make research that is done with routine data more reproducible, such as the availability of a drive for more transparency in research methods documentation, such as publishing complete phenotype definitions and clear code lists. Conclusions: The findings of this thesis indicated that most participants valued a concept library for phenotypes. However, only half of the participants felt that they would contribute by providing definitions for the concept library, and they reported many barriers regarding sharing their work on a publicly accessible platform such as the CALIBER research platform. Analysis of interviews, focus group discussions, and qualitative studies revealed that different users have different requirements, facilitators, barriers, and concerns about concept libraries. This work was to investigate if we should develop concept libraries in Kuwait to facilitate the development of improved data sharing. However, at the end of this thesis the recommendation is this would be unlikely to be cost effective or highly valued by users and investment in open access research publications may be of more value to the Kuwait research/academic community

    PandeMedia: an annotated corpus of digital media for issue salience

    Get PDF
    Tese de mestrado, Ciência de Dados, 2022, Universidade de Lisboa, Faculdade de CiênciasThe ubiquitous sharing of information via the Internet has shifted much of society’s communication and information-seeking to digital spaces, such as news websites and social networks. As the Web represents a massive hub of information dissemination and discussion, it has also made possible the extraction of great amounts of highly detailed data to answer complex questions on human behaviour and preferences. This shift towards online life was exaggerated during the earlier phases of the COVID-19 pandemic, when many countries were in lockdown and in-person contact was severely limited. Therefore, in addition to the ongoing political, economic, and public health crisis, there were, on the one hand, new opportunities to study human behaviour thought digital data, including support for public health measures or trust in science, while, on the other hand, the deluge of new data and the fast-changing nature of the pandemic created new challenges to data science research, particularly the need to build quality pipelines for data extraction, collection, and future analysis. In this thesis, we focus on the important issue of salience of science and scientists during a health crisis and ask how to build a pipeline to select, store, extract and analyse longitudinal digital media data, that might allow for long-term study of media effects on salience. Therefore, this project has two main components: first, we showcase a data pipeline that makes use of media and social media data, available online, to build a media corpus of news and tweets with millions of documents, spanning billions of tokens, corresponding to more than two years of coverage and multiple media sources and topics; second, we show how this corpus can be leveraged to study the problem of salience, and use the visibility of science during the earlier phases of the COVID-19 pandemic as a case-study, comparing between salience in traditional versus social media. Overall, we present both a transparent and scaleable pipeline and a specific application of this approach, to tackle the question of how science visibility changed during this massive crisis. We use different media types and sources to potentiate text mining and other analytical purposes, offering a digital data-centric computational methodology to investigate questions in the social sciences.Os dados tomam, nos dias de hoje, um papel central no funcionamento das sociedades humanas. Com o desenvolvimento das tecnologias digitais, aliadas à ubíqua conetividade à Internet, em particular à World Wide Web (WWW), vivemos na chamada “era da informação” . Este paradigma da sociedade alicerça-se no fenómeno tipicamente referido como datafication, que se refere ao processo já enraizado e inerente à vida quotidiana através do qual a nossa atividade humana e formas de participação na sociedade são convertidas em dados. Esta produção em larga escala e em tempo real de dados funciona como o combustível para um amplo leque de aplicações nos mais variados domínios, desde a indústria, à investigação científica, à saúde, entre outros. Deste modo, testemunhamos uma crescente procura, e mesmo necessidade, de grandes coleções de dados, para alimentarem os diferentes setores de atividade. A Web representa talvez o maior volume de dados amplamente disponível ao público em geral. É nos websites e nas aplicações online que uma grande parte da população realiza diariamente um conjunto de tarefas e ações, sejam estas de caráter profissional ou lúdico. Os nossos hábitos de consumo de informação são assegurados predominantemente por estes espaços digitais, como as redes sociais ou as plataformas digitais de media tradicionais. Da mesma forma, as nossas interações sociais mediadas por dispositivos digitais são cada vez mais frequentes. A Web é, portanto, um reservatório de potenciais descobertas e de informação valiosa, que pode ser eventualmente extraída através da exploração dos dados que contém. Pela sua própria natureza, a Web levanta grandes desafios relativos às formas de capturar este valor presente nos dados digitais. Enormes volumes de dados podem ser rapidamente e facilmente identificados e extraídos. No entanto, não existe um processo de acréscimo de valor a estes dados sem que passem primeiramente por uma fase de organização. Para que seja possível extrair conhecimento dos dados obtidos, é necessário que estes apresentam a devida organização e qualidade. As maiores dificuldades nas metodologias de colheita e gestão de dados digitais passam por assegurar precisamente esta qualidade. Os dados da Web são naturalmente muito heterogéneos, visto resultarem da convergência de imensas fontes de informação. São também, na sua maioria, não estruturados, nomeadamente em formatos textuais que precisam de ser interpretados computacionalmente e compartimentalizados para facilitar futura análise. Muitas vezes, existem também dados em falta ou que apresentam uma qualidade tão baixa que são inviáveis para as finalidades em mente. Para além destes fatores intrínsecos aos dados em si, as questões que os rodeiam são também cruciais a considerar: a capacidade de detetar e localizar os dados pretendidos, a capacidade de aceder a estes dados, e o grau de disponibilidade destes dados, quando acessíveis. Deve também ter-se em consideração as questões éticas, de privacidade e de direitos de autor associadas aos dados passíveis de serem colecionados. ... automatizar processos de colheita para fontes e tipos de dados tão diversos quanto aqueles que se encontram disponíveis na Web. A pandemia causada pelo SARS-CoV-2, agente da COVID-19, representa uma crise de enormes proporções nas esferas política, económica e de saúde pública. Com a população do mundo restrita nos seus comportamentos e hábitos de modo a prevenir um agravamento da propagação do vírus, as pessoas recorreram ao digital como meio de comunicação e de obtenção e disseminação de informação (e desinformação). Assim, os media e as redes sociais foram relevantes pontos de convergência de uma grande parte da atenção do público, levantando questões importantes sobre a perceção pública dos especialistas científicos e sobre a saliência de certos tópicos de discussão. Num contexto mais alargado, podemos perspetivar a crise pandémica como um desafio no domínio das tecnologias da informação. No desenvolver desta emergência de saúde pública, temos vindo a ser confrontados com vários dos desafios presentes em data science: dados complexos, na escala de populações inteiras, a serem produzidos em tempo real por múltiplas fontes, com diferentes estruturas e formatos, e que sofrem uma rápida desatualização, requerem rápida análise, mas também processos de limpeza e melhoramento robustos. Todos estes fatores nos levam à nossa questão principal: numa crise que evolui tão rapidamente como a pandemia da COVID-19, como podemos construir uma pipeline que nos permita responder aos desafios da coleção e gestão de dados, de modo a criar um dataset de media digital para análise? Para extrair os dados necessários, recorremos a três fontes distintas: a plataforma open-source Media Cloud, a base de dados Internet Archive, e o API da rede social Twitter. Começámos por definir dezoito tópicos distintos, constituídos por palavras-chaves para uso na pesquisa pelos artigos e posts de media. Alguns tópicos são relacionados com a pandemia, enquanto outros funcionam como potenciais controlos positivos e negativos. A coesão semântica de cada tópico foi assegurada através do uso da base de dados léxica WordNet, que fornece significados e relações de palavras. Os metadados inicialmente obtidos foram processados e utilizados para identificar as fontes primárias dos dados de notícias. A partir de Web scraping, obtivemos dados brutos de artigos de media dos Estados Unidos da América disponíveis online, de Janeiro de 2019 a Janeiro de 2021 (inclusive). Estes foram subsequentemente transformados, passando por um processo de filtragem, limpeza e formatação, que é acompanhado de uma análise exploratória dos dados e visualização de dados para efeitos de diagnóstico do processo completo. Os dados da rede social foram extraídos através de um API próprio, especificando parâmetros para restringir resultados aos Estados Unidos e ao intervalo de tempo anteriormente definido. Os dados devidamente tratados foram posteriormente armazenados na base de dados desenhada e contruída para o propósito. A base de dados foi concebida com quatro tabelas, que incluem os dados de notícias, os dados da rede social Twitter, os metadados das pesquisas originais e metadados sobre as fontes das notícias, e feita através do sistema de gestão de bases de dados PostgreSQL. Para otimizar o desempenho das pesquisas no nosso conjunto de dados, procedemos à construção de índices para campos específicos, nomeadamente campos de texto, que são o nosso interesse principal. Utilizando as funcionalidades disponíveis, foram construídas representações vetoriais do texto das notícias, e a partir destas foi contruído um índice apropriado para pesquisa em dados textuais, que reduziu o tempo de pesquisa por um fator nas dezenas de milhares de vezes. Demonstramos ainda a pesquisa preliminar de dados longitudinais para efeitos de estudo da saliência de diferentes tópicos nos meios de comunicação. Foram aplicadas diferentes metodologias estatísticas de análise de séries temporais para responder às questões a abordar. Através do uso de médias móveis, os sinais foram clarificados para melhor visualização. Os testes de estacionaridade serviram de diagnóstico para as transformações a aplicar aos dados de modo a garantir a validade de análises posteriores. Com testes de causalidade de Granger, foi possível estabelecer relações entre séries temporais com base no poder preditivo e assim compreender a dinâmica de interação de diferentes media. Com recurso a técnicas de deteção de pontos de quebra, conseguimos defender a ideia de que existiram períodos de mudança dos padrões observados nos media que coincidem com o despoletar da crise pandémica. Assim, potenciada por uma pipeline customizada, robusta e transparente, conseguimos gerar um corpus de media, contendo milhões de documentos, que albergam milhares de milhões de tokens, correspondendo a um período de tempo superior a dois anos e múltiplas fontes de notícias e tópicos, permitindo assim potenciar finalidades de mineração de texto (text mining) e outros propósitos analíticos, oferecendo uma metodologia computacional centrada nos dados digitais para investigar este tipo de questões nas ciências sociais
    corecore