59 research outputs found
Cell line name recognition in support of the identification of synthetic lethality in cancer from text
Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers
BioRED: A Comprehensive Biomedical Relation Extraction Dataset
Automated relation extraction (RE) from biomedical literature is critical for
many downstream text mining applications in both research and real-world
settings. However, most existing benchmarking datasets for bio-medical RE only
focus on relations of a single type (e.g., protein-protein interactions) at the
sentence level, greatly limiting the development of RE systems in biomedicine.
In this work, we first review commonly used named entity recognition (NER) and
RE datasets. Then we present BioRED, a first-of-its-kind biomedical RE corpus
with multiple entity types (e.g., gene/protein, disease, chemical) and relation
pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles.
Further, we label each relation as describing either a novel finding or
previously known background knowledge, enabling automated algorithms to
differentiate between novel and background information. We assess the utility
of BioRED by benchmarking several existing state-of-the-art methods, including
BERT-based models, on the NER and RE tasks. Our results show that while
existing approaches can reach high performance on the NER task (F-score of
89.3%), there is much room for improvement for the RE task, especially when
extracting novel relations (F-score of 47.7%). Our experiments also demonstrate
that such a comprehensive dataset can successfully facilitate the development
of more accurate, efficient, and robust RE systems for biomedicine
BERT WEAVER: Using WEight AVERaging to enable lifelong learning for transformer-based models in biomedical semantic search engines
Recent developments in transfer learning have boosted the advancements in
natural language processing tasks. The performance is, however, dependent on
high-quality, manually annotated training data. Especially in the biomedical
domain, it has been shown that one training corpus is not enough to learn
generic models that are able to efficiently predict on new data. Therefore, in
order to be used in real world applications state-of-the-art models need the
ability of lifelong learning to improve performance as soon as new data are
available - without the need of re-training the whole model from scratch. We
present WEAVER, a simple, yet efficient post-processing method that infuses old
knowledge into the new model, thereby reducing catastrophic forgetting. We show
that applying WEAVER in a sequential manner results in similar word embedding
distributions as doing a combined training on all data at once, while being
computationally more efficient. Because there is no need of data sharing, the
presented method is also easily applicable to federated learning settings and
can for example be beneficial for the mining of electronic health records from
different clinics
Enhancing biomedical word embeddings by retrofitting to verb clusters
Verbs play a fundamental role in many biomedical tasks and applications such as relation and event extraction. We hypothesize that performance on many downstream tasks can be improved by aligning the input pretrained embeddings according to semantic verb classes. In this work, we show that by using semantic clusters for verbs, a large lexicon of verb classes derived from biomedical literature, we are able to improve the performance of common pretrained embeddings in downstream tasks by retrofitting them to verb classes. We present a simple and computationally efficient approach using a widely available “off-theshelf” retrofitting algorithm to align pretrained embeddings according to semantic verb clusters. We achieve state-of-the-art results on text classification and relation extraction tasks
- …