3,735 research outputs found
Cell line name recognition in support of the identification of synthetic lethality in cancer from text
Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers
Spanish named entity recognition in the biomedical domain
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft
From POS tagging to dependency parsing for biomedical event extraction
Background: Given the importance of relation or event extraction from
biomedical research publications to support knowledge capture and synthesis,
and the strong dependency of approaches to this information extraction task on
syntactic information, it is valuable to understand which approaches to
syntactic processing of biomedical text have the highest performance. Results:
We perform an empirical study comparing state-of-the-art traditional
feature-based and neural network-based models for two core natural language
processing tasks of part-of-speech (POS) tagging and dependency parsing on two
benchmark biomedical corpora, GENIA and CRAFT. To the best of our knowledge,
there is no recent work making such comparisons in the biomedical context;
specifically no detailed analysis of neural models on this data is available.
Experimental results show that in general, the neural models outperform the
feature-based models on two benchmark biomedical corpora GENIA and CRAFT. We
also perform a task-oriented evaluation to investigate the influences of these
models in a downstream application on biomedical event extraction, and show
that better intrinsic parsing performance does not always imply better
extrinsic event extraction performance. Conclusion: We have presented a
detailed empirical study comparing traditional feature-based and neural
network-based models for POS tagging and dependency parsing in the biomedical
context, and also investigated the influence of parser selection for a
biomedical event extraction downstream task. Availability of data and material:
We make the retrained models available at
https://github.com/datquocnguyen/BioPosDepComment: Accepted for publication in BMC Bioinformatic
- …