135 research outputs found
Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning
abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201
Neural Network Approaches to Medical Toponym Recognition
Toponym identification, or place name recognition, within epidemiology articles is a crucial task for phylogeographers, as it allows them to analyze the development, spread, and migration of viruses. Although, public databases, such as GenBank (Benson et al., November 2012), contain the geographical information, this information is typically restricted to country and state levels. In order to identify more fine-grained localization information, epidemiologists need to read relevant scientific articles and manually extract place name mentions.
In this thesis, we investigate the use of various neural network architectures and language representations to automatically segment and label toponyms within biomedical texts. We demonstrate how our language model based toponym recognizer relying on transformer architecture can achieve state-of-the-art performance. This model uses pre-trained BERT as the backbone and fine tunes on two domains of datasets (general articles and medical articles) in order to measure the generalizability of the approach and cross-domain transfer learning.
Using BERT as the backbone of the model, resulted in a large highly parameterized model (340M parameters). In order to obtain a light model architecture we experimented with parameter pruning techniques, specifically we experimented with Lottery Ticket Hypothesis (Frankle and Carbin, May 2019) (LTH), however as indicated by Frankle and Carbin (May 2019), their pruning technique does not scale well to highly parametrized models and loses stability. We proposed a novel technique to augment LTH in order to increase the scalability and stability of this technique to highly parametrized models such as BERT and tested our technique on toponym identification task.
The evaluation of the model was performed using a collection of 105 epidemiology articles from PubMed Central (Weissenbacher et al., June 2015). Our proposed model significantly improves the state-of-the-art model by achieving an F-measure of 90.85% compared to 89.13%
A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching
Recognizing toponyms and resolving them to their real-world referents is required to provide advanced semantic access to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate selection is the task of identifying the potential entities that can be referred to by a previously recognized toponym. While it has traditionally received little attention, candidate selection has a significant impact on downstream tasks (i.e. entity resolution), especially in noisy or non-standard text. In this paper, we introduce a deep learning method for candidate selection through toponym matching, using state-of-the-art neural network architectures. We perform an intrinsic toponym matching evaluation based on several datasets, which cover various challenging scenarios (cross-lingual and regional variations, as well as OCR errors) and assess its performance in the context of geographical candidate selection in English and Spanish. </p
Automated Georeferencing of Antarctic Species
Many text documents in the biological domain contain references to the toponym of specific phenomena (e.g. species sightings) in natural language form "In Garwood Valley summer activity was 0.2% for Umbilicaria aprina and 1.7% for Caloplaca sp. ..."
While methods have been developed to extract place names from documents, and attention has been given to the interpretation of spatial prepositions, the ability to connect toponym mentions in text with the phenomena to which they refer (in this case species) has been given limited attention, but would be of considerable benefit for the task of mapping specific phenomena mentioned in text documents.
As part of work to create a pipeline to automate georeferencing of species within legacy documents, this paper proposes a method to: (1) recognise species and toponyms within text and (2) match each species mention to the relevant toponym mention. Our methods find significant promise in a bespoke rules- and dictionary-based approach to recognise species within text (F1 scores up to 0.87 including partial matches) but less success, as yet, recognising toponyms using multiple gazetteers combined with an off the shelf natural language processing tool (F1 up to 0.62).
Most importantly, we offer a contribution to the relatively nascent area of matching toponym references to the object they locate (in our case species), including cases in which the toponym and species are in different sentences. We use tree-based models to achieve precision as high as 0.88 or an F1 score up to 0.68 depending on the downsampling rate. Initial results out perform previous research on detecting entity relationships that may cross sentence boundaries within biomedical text, and differ from previous work in specifically addressing species mapping
Extracting locations from sport and exercise-related social media messages using a neural network-based bilingual toponym recognition model
Sport and exercise contribute to health and well-being in cities. While previous research has mainly focused on activities at specific locations such as sport facilities, "informal sport" that occur at arbitrary locations across the city have been largely neglected. Such activities are more challenging to observe, but this challenge may be addressed using data collected from social media platforms, because social media users regularly generate content related to sports and exercise at given locations. This allows studying all sport, including those "informal sport" which are at arbitrary locations, to better understand sports and exercise-related activities in cities. However, user-generated geographical information available on social media platforms is becoming scarcer and coarser. This places increased emphasis on extracting location information from free-form text content on social media, which is complicated by multilingualism and informal language. To support this effort, this article presents an end-to-end deep learning-based bilingual toponym recognition model for extracting location information from social media content related to sports and exercise. We show that our approach outperforms five state-of-the-art deep learning and machine learning models. We further demonstrate how our model can be deployed in a geoparsing framework to support city planners in promoting healthy and active lifestyles.Peer reviewe
Recommended from our members
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI
- …