5 research outputs found

    The Hmong Medical Corpus: a biomedical corpus for a minority language

    Get PDF
    Biomedical communication is an area that increasingly benefits from natural language processing (NLP) work. Biomedical named entity recognition (NER) in particular provides a foundation for advanced NLP applications, such as automated medical question-answering and translation services. However, while a large body of biomedical documents are available in an array of languages, most work in biomedical NER remains in English, with the remainder in official national or regional languages. Minority languages so far remain an underexplored area. The Hmong language, a minority language with sizable populations in several countries and without official status anywhere, represents an exceptional challenge for effective communication in medical contexts. Taking advantage of the large number of government-produced medical information documents in Hmong, we have developed the first named entity-annotated biomedical corpus for a resource-poor minority language. The Hmong Medical Corpus contains 100,535 tokens with 4554 named entities (NEs) of three UMLS semantic types: diseases/syndromes, signs/symptoms, and body parts/organs/organ components. Furthermore, a subset of the corpus is annotated for word position and parts of speech, representing the first such gold-standard dataset publicly available for Hmong. The methodology presented provides a readily reproducible approach for the creation of biomedical NE-annotated corpora for other resource-poor languages

    An Automated Method to Enrich and Expand Consumer Health Vocabularies Using GloVe Word Embeddings

    Get PDF
    Clear language makes communication easier between any two parties. However, a layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon, which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this dissertation, we present an automatic method to enrich existing concepts in a medical ontology with additional laymen terms and also to expand the number of concepts in the ontology that do not have associated laymen terms. Our work has the benefit of being applicable to vocabularies in any domain. Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies. We improve these vocabularies by incorporating synonyms and hyponyms from the WordNet ontology. By performing iterative feedback using GloVe’s candidate terms, we can boost the number of word occurrences in the co-occurrence matrix allowing our approach to work with a smaller training corpus. Our novel algorithms and GloVe were evaluated using two laymen datasets from the National Library of Medicine (NLM), the Open-Access and Collaborative Consumer Health Vocabulary (OAC CHV) and the MedlinePlus Healthcare Vocabulary. For our first goal, enriching concepts, the results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Our best algorithm enhanced the corpus with synonyms from WordNet, outperformed GloVe with an F-score relative improvement of 25%. For our second goal, expanding the number of concepts with related laymen’s terms, our synonym-enhanced GloVe outperformed GloVe with a relative F-score relative improvement of 63%. The results of the system were in general promising and can be applied not only to enrich and expand laymen vocabularies for medicine but any ontology for a domain, given an appropriate corpus for the domain. Our approach is applicable to narrow domains that may not have the huge training corpora typically used with word embedding approaches. In essence, by incorporating an external source of linguistic information, WordNet, and expanding the training corpus, we are getting more out of our training corpus. Our system can help building an application for patients where they can read their physician\u27s letters more understandably and clearly. Moreover, the output of this system can be used to improve the results of healthcare search engines, entity recognition systems, and many others

    Dataset Search In Biodiversity Research: Do Metadata In Data Repositories Reflect Scholarly Information Needs?

    Get PDF
    The increasing amount of research data provides the opportunity to link and integrate data to create novel hypotheses, to repeat experiments or to compare recent data to data collected at a different time or place. However, recent studies have shown that retrieving relevant data for data reuse is a time-consuming task in daily research practice. In this study, we explore what hampers dataset retrieval in biodiversity research, a field that produces a large amount of heterogeneous data. We analyze the primary source in dataset search - metadata - and determine if they reflect scholarly search interests. We examine if metadata standards provide elements corresponding to search interests, we inspect if selected data repositories use metadata standards representing scholarly interests, and we determine how many fields of the metadata standards used are filled. To determine search interests in biodiversity research, we gathered 169 questions that researchers aimed to answer with the help of retrieved data, identified biological entities and grouped them into 13 categories. Our findings indicate that environments, materials and chemicals, species, biological and chemical processes, locations, data parameters and data types are important search interests in biodiversity research. The comparison with existing metadata standards shows that domain-specific standards cover search interests quite well, whereas general standards do not explicitly contain elements that reflect search interests. We inspect metadata from five large data repositories. Our results confirm that metadata currently poorly reflect search interests in biodiversity research. From these findings, we derive recommendations for researchers and data repositories how to bridge the gap between search interest and metadata provided
    corecore