2,388 research outputs found
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning
abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201
Identificação e análise de estados de saúde em mensagens do twitter
Social media has become very widely used all over the world for its ability to connect people from different countries and create global communities. One of the most prominent social media platforms is Twitter. Twitter is a platform where users can share text segments with a maximum length of 280 characters. Due to the nature of the platform, it generates very large amounts of text data about its users’ lives. This data can be used to extract health information about a segment of the population for the purpose of public health surveillance. Social Media Mining for Health Shared Task is a challenge that encompasses many Natural Language Processing tasks related to the use of social media data for health research purposes. This dissertation describes the approach I used in my participation in the Social Media Mining for Health Shared Task. I participated in task 1 of the Shared Task. This task was divided into three subtasks. Subtask 1a consisted of the classification of Tweets regarding the presence of Adverse Drug Events. Subtask 1b was a Named Entity Recognition task that aimed at detecting Adverse Drug Effect spans in tweets. Subtask 1c was a normalization task that sought to match an Adverse Drug Event mention to a Medical Dictionary for Regulatory Activities preferred term ID. Toward discovering the best approach for each of the subtasks I made many experiments with different models and techniques to distinguish the ones that were more suited for each subtask. To solve these subtasks, I used transformer-based models as well as other techniques that aim at solving the challenges present in each of the subtasks. The best-performing approach for subtask 1a was a BERTweet large model trained with an augmented training set. As for subtask 1b, the best results were obtained through a RoBERTa large model with oversampled training data. Regarding subtask 1c, I used a RoBERTa base model trained with data from an additional dataset beyond the one made available by the shared task organizers. The systems used for subtasks 1a and 1b both achieved state-of-the-art performance, however, the approach for the third subtask was not able to achieve favorable results. The system used in subtask 1a achieved an F1 score of 0.698, the one used in subtask 1b achieved a relaxed F1 score of 0.661, and the one used in the final subtask achieved a relaxed F1 score of 0.116.As redes sociais tornaram-se muito utilizadas por todo o mundo, permitindo ligar pessoas de diferentes países e criar comunidades globais. O Twitter, uma das redes sociais mais populares, permite que os seus utilizadores partilhem segmentos curtos de texto com um máximo de 280 caracteres. Esta partilha na rede gera uma enorme quantidade de dados sobre os seus utilizadores, podendo ser analisados sobre múltiplas perspetivas. Por exemplo, podem ser utilizados para extrair informação sobre a saúde de um segmento da população tendo em vista a vigilância de saúde pública.
O objetivo deste trabalho foi a investigação e o desenvolvimento de soluções técnicas para participar no “Social Media Mining for Health Shared Task” (#SMM4H), um desafio constituído por diversas tarefas de processamento de linguagem natural relacionadas com o uso de dados provenientes de redes sociais para o propósito de investigação na área da saúde. O trabalho envolveu o desenvolvimento de modelos baseados em transformadores e outras técnicas relacionadas, para participação na tarefa 1 deste desafio, que por sua vez está dividida em 3 subtarefas: 1a) classificação de tweets relativamente à presença ou não de eventos adversos de medicamentos (ADE); 1b) reconhecimento de entidades com o objetivo de detetar menções de ADE; 1c) tarefa de normalização com o objetivo de associar as menções de ADE ao termo MedDRA correspondente (“Medical Dictionary for Regulatory Activities”). A abordagem com melhor desempenho na tarefa 1a foi um modelo BERTweet large treinado com dados gerados através de um processo de data augmentation. Relativamente à tarefa 1b, os melhores resultados foram obtidos usando um modelo RoBERTa large com dados de treino sobreamostrados. Na tarefa 1c utilizou-se um modelo RoBERTa base treinado com dados adicionais provenientes de um conjunto de dados externo. A abordagem utilizada na terceira tarefa não conseguiu alcançar resultados relevantes (F1 de 0.12), enquanto que os sistemas desenvolvidos para as duas primeiras alcançaram resultados ao nível dos melhores do desafio (F1 de 0.69 e 0.66 respetivamente).Mestrado em Engenharia Informátic
Incorporating Ontological Information in Biomedical Entity Linking of Phrases in Clinical Text
Biomedical Entity Linking (BEL) is the task of mapping spans of text within biomedical documents to normalized, unique identifiers within an ontology. Translational application of BEL on clinical notes has enormous potential for augmenting discretely captured data in electronic health records, but the existing paradigm for evaluating BEL systems developed in academia is not well aligned with real-world use cases. In this work, we demonstrate a proof of concept for incorporating ontological similarity into the training and evaluation of BEL systems to begin to rectify this misalignment. This thesis has two primary components: 1) a comprehensive literature review and 2) a methodology section to propose novel BEL techniques to contribute to scientific progress in the field. In the literature review component, I survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, and outline the technical components that vii comprise BEL systems. In the methodology component, I describe my experiments incorporating ontological information into training a BERT encoder for entity linking
약물 감시를 위한 비정형 텍스트 내 임상 정보 추출 연구
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 응용바이오공학과, 2023. 2. 이형기.Pharmacovigilance is a scientific activity to detect, evaluate and understand the occurrence of adverse drug events or other problems related to drug safety. However, concerns have been raised over the quality of drug safety information for pharmacovigilance, and there is also a need to secure a new data source to acquire drug safety information. On the other hand, the rise of pre-trained language models
based on a transformer architecture has accelerated the application of natural language processing (NLP) techniques in diverse domains. In this context, I tried to define two problems in pharmacovigilance as an NLP task and provide baseline models for the defined tasks: 1) extracting comprehensive drug safety information from adverse drug events narratives reported through a spontaneous reporting system (SRS) and 2) extracting drug-food interaction information from abstracts of biomedical articles. I developed annotation guidelines and performed manual annotation, demonstrating that strong NLP models can be trained to extracted clinical information from unstructrued free-texts by fine-tuning transformer-based language models on a high-quality annotated corpus. Finally, I discuss issues to consider when when developing annotation guidelines for extracting clinical information related to pharmacovigilance. The annotated corpora and the NLP models in this dissertation can streamline pharmacovigilance activities by enhancing the data quality of reported drug safety information and expanding the data sources.약물 감시는 약물 부작용 또는 약물 안전성과 관련된 문제의 발생을 감지, 평가 및 이해하기 위한 과학적 활동이다. 그러나 약물 감시에 사용되는 의약품 안전성 정보의 보고 품질에 대한 우려가 꾸준히 제기되었으며, 해당 보고 품질을 높이기 위해서는 안전성 정보를 확보할 새로운 자료원이 필요하다. 한편 트랜스포머 아키텍처를 기반으로 사전훈련 언어모델이 등장하면서 다양한 도메인에서 자연어처리 기술 적용이 가속화되었다. 이러한 맥락에서 본 학위 논문에서는 약물 감시를 위한 다음 2가지 정보 추출 문제를 자연어처리 문제 형태로 정의하고 관련 기준 모델을 개발하였다: 1) 수동적 약물 감시 체계에 보고된 이상사례 서술자료에서 포괄적인 약물 안전성 정보를 추출한다. 2) 영문 의약학 논문 초록에서 약물-식품 상호작용 정보를 추출한다. 이를 위해 안전성 정보 추출을 위한 어노테이션 가이드라인을 개발하고 수작업으로 어노테이션을 수행하였다. 결과적으로 고품질의 자연어 학습데이터를 기반으로 사전학습 언어모델을 미세 조정함으로써 비정형 텍스트에서 임상 정보를 추출하는 강력한 자연어처리 모델 개발이 가능함을 확인하였다. 마지막으로 본 학위 논문에서는 약물감시와 관련된임상 정보 추출을 위한 어노테이션 가이드라인을 개발할 때 고려해야 할 주의 사항에 대해 논의하였다. 본 학위 논문에서 소개한 자연어 학습데이터와 자연어처리 모델은 약물 안전성 정보의 보고 품질을 향상시키고 자료원을 확장하여 약물 감시 활동을 보조할 것으로 기대된다.Chapter 1 1
1.1 Contributions of this dissertation 2
1.2 Overview of this dissertation 2
1.3 Other works 3
Chapter 2 4
2.1 Pharmacovigilance 4
2.2 Biomedical NLP for pharmacovigilance 6
2.2.1 Pre-trained language models 6
2.2.2 Corpora to extract clinical information for pharmacovigilance 9
Chapter 3 11
3.1 Motivation 12
3.2 Proposed Methods 14
3.2.1 Data source and text corpus 15
3.2.2 Annotation of ADE narratives 16
3.2.3 Quality control of annotation 17
3.2.4 Pretraining KAERS-BERT 18
3.2.6 Named entity recognition 20
3.2.7 Entity label classification and sentence extraction 21
3.2.8 Relation extraction 21
3.2.9 Model evaluation 22
3.2.10 Ablation experiment 23
3.3 Results 24
3.3.1 Annotated ICSRs 24
3.3.2 Corpus statistics 26
3.3.3 Performance of NLP models to extract drug safety information 28
3.3.4 Ablation experiment 31
3.4 Discussion 33
3.5 Conclusion 38
Chapter 4 39
4.1 Motivation 39
4.2 Proposed Methods 43
4.2.1 Data source 44
4.2.2 Annotation 45
4.2.3 Quality control of annotation 49
4.2.4 Baseline model development 49
4.3 Results 50
4.3.1 Corpus statistics 50
4.3.2 Annotation Quality 54
4.3.3 Performance of baseline models 55
4.3.4 Qualitative error analysis 56
4.4 Discussion 59
4.5 Conclusion 63
Chapter 5 64
5.1 Issues around defining a word entity 64
5.2 Issues around defining a relation between word entities 66
5.3 Issues around defining entity labels 68
5.4 Issues around selecting and preprocessing annotated documents 68
Chapter 6 71
6.1 Dissertation summary 71
6.2 Limitation and future works 72
6.2.1 Development of end-to-end information extraction models from free-texts to database based on existing structured information 72
6.2.2 Application of in-context learning framework in clinical information extraction 74
Chapter 7 76
7.1 Annotation Guideline for "Extraction of Comprehensive Drug Safety Information from Adverse Event Narratives Reported through Spontaneous Reporting System" 76
7.2 Annotation Guideline for "Extraction of Drug-Food Interactions from the Abtracts of Biomedical Articles" 100박
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective
In this work, we present the first corpus for German Adverse Drug Reaction
(ADR) detection in patient-generated content. The data consists of 4,169 binary
annotated documents from a German patient forum, where users talk about health
issues and get advice from medical doctors. As is common in social media data
in this domain, the class labels of the corpus are very imbalanced. This and a
high topic imbalance make it a very challenging dataset, since often, the same
symptom can have several causes and is not always related to a medication
intake. We aim to encourage further multi-lingual efforts in the domain of ADR
detection and provide preliminary experiments for binary classification using
different methods of zero- and few-shot learning based on a multi-lingual
model. When fine-tuning XLM-RoBERTa first on English patient forum data and
then on the new German data, we achieve an F1-score of 37.52 for the positive
class. We make the dataset and models publicly available for the community.Comment: Accepted at LREC 202
GNTeam at 2018 n2c2:Feature-augmented BiLSTM-CRF for drug-related entity recognition in hospital discharge summaries
Monitoring the administration of drugs and adverse drug reactions are key
parts of pharmacovigilance. In this paper, we explore the extraction of drug
mentions and drug-related information (reason for taking a drug, route,
frequency, dosage, strength, form, duration, and adverse events) from hospital
discharge summaries through deep learning that relies on various
representations for clinical named entity recognition. This work was officially
part of the 2018 n2c2 shared task, and we use the data supplied as part of the
task. We developed two deep learning architecture based on recurrent neural
networks and pre-trained language models. We also explore the effect of
augmenting word representations with semantic features for clinical named
entity recognition. Our feature-augmented BiLSTM-CRF model performed with
F1-score of 92.67% and ranked 4th for entity extraction sub-task among
submitted systems to n2c2 challenge. The recurrent neural networks that use the
pre-trained domain-specific word embeddings and a CRF layer for label
optimization perform drug, adverse event and related entities extraction with
micro-averaged F1-score of over 91%. The augmentation of word vectors with
semantic features extracted using available clinical NLP toolkits can further
improve the performance. Word embeddings that are pre-trained on a large
unannotated corpus of relevant documents and further fine-tuned to the task
perform rather well. However, the augmentation of word embeddings with semantic
features can help improve the performance (primarily by boosting precision) of
drug-related named entity recognition from electronic health records
- …