45 research outputs found
Towards Automatic Extraction of Social Networks of Organizations in PubMed Abstracts
Social Network Analysis (SNA) of organizations can attract great interest
from government agencies and scientists for its ability to boost translational
research and accelerate the process of converting research to care. For SNA of
a particular disease area, we need to identify the key research groups in that
area by mining the affiliation information from PubMed. This not only involves
recognizing the organization names in the affiliation string, but also
resolving ambiguities to identify the article with a unique organization. We
present here a process of normalization that involves clustering based on local
sequence alignment metrics and local learning based on finding connected
components. We demonstrate the application of the method by analyzing
organizations involved in angiogenensis treatment, and demonstrating the
utility of the results for researchers in the pharmaceutical and biotechnology
industries or national funding agencies.Comment: This paper has been withdrawn; First International Workshop on Graph
Techniques for Biomedical Networks in Conjunction with IEEE International
Conference on Bioinformatics and Biomedicine, Washington D.C., USA, Nov. 1-4,
2009; http://www.public.asu.edu/~sjonnal3/home/papers/IEEE%20BIBM%202009.pd
NEMO: Extraction and normalization of organization names from PubMed affiliations
Background: We are witnessing an exponential increase in biomedical research citations in PubMed. However, translating biomedical discoveries into practical treatments is estimated to
take around 17 years, according to the 2000 Yearbook of Medical Informatics, and much information is lost during this transition. Pharmaceutical companies spend huge sums to identify opinion leaders and centers of excellence. Conventional methods such as literature search, survey, observation, selfâidentification, expert opinion, and sociometry not only need much human effort, but are also nonâcomprehensive. Such huge delays and costs can be reduced by âconnecting those who produce the knowledge with those who apply itâ. A humble step in this direction is largeâscale discovery of persons and organizations involved in specific areas of research. This can be achieved by automatically extracting and disambiguating author names and affiliation strings retrieved through Medical Subject Heading (MeSH) terms and other keywords associated with articles in PubMed. In this study, we propose NEMO (Normalization Engine for Matching Organizations), a system for extracting organization names from the affiliation strings provided in PubMed abstracts, building a thesaurus (list of synonyms) of organization names, and subsequently normalizing them to a canonical organization name using
the thesaurus.
Results: We used a parsing process that involves multiâlayered rule matching with multiple dictionaries. The normalization process involves clustering based on weighted local sequence
alignment metrics to address synonymy at word level, and local learning based on finding connected components to address synonymy. The graphical user interface and java client library
of NEMO are available at http://lnxnemo.sourceforge.net.
Conclusion: NEMO associates each biomedical paper and its authors with a unique organization name and the geopolitical location of that organization. This system provides more accurate
information about organizations than the raw affiliation strings provided in PubMed abstracts. It can be used for : a) bimodal social network analysis that evaluates the research relationships
between individual researchers and their institutions; b) improving author name disambiguation; c) augmenting National Library of Medicine (NLM)âs Medical Articles Record
System (MARS) system for correcting errors due to OCR on affiliation strings that are in small fonts; and d) improving PubMed citation indexing strategies (authority control) based on
normalized organization name and country
Massive-scale Decoding for Text Generation using Lattices
Conditional neural text generation models generate high-quality outputs, but
often concentrate around a mode when what we really want is a diverse set of
options. We present a search algorithm to construct lattices encoding a massive
number of generation options. First, we restructure decoding as a best-first
search, which explores the space differently than beam search and improves
efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis
recombination: we can identify pairs of similar generation candidates during
search and merge them as an approximation. On both summarization and machine
translation, we show that our algorithm encodes thousands of diverse options
that remain grammatical and high-quality into one lattice. This algorithm
provides a foundation for building downstream generation applications on top of
massive-scale diverse outputs.Comment: NAACL 2022, see https://github.com/jiacheng-xu/lattice-generation for
cod
Enhancing clinical concept extraction with distributional semantics
AbstractExtracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text.The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type âclinical trialsâ to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task.The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged F-score for exact match increased from 80.3% to 82.3% and the micro-averaged F-score based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data
Formative evaluation of a patient-specific clinical knowledge summarization tool
To iteratively design a prototype of a computerized clinical knowledge summarization (CKS) tool aimed at helping clinicians finding answers to their clinical questions; and to conduct a formative assessment of the usability, usefulness, efficiency, and impact of the CKS prototype on physiciansâ perceived decision quality compared with standard search of UpToDate and PubMed