1,795 research outputs found
Large-scale event extraction from literature with multi-level gene normalization
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license
Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations
Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators
Combining supervised and unsupervised named entity recognition to detect psychosocial risk factors in occupational health checks
Introduction: In occupational health checks the information about psychosocial risk factors, which influence work ability, is documented in free text. Early detection of psychosocial risk factors helps occupational health care to choose the right and targeted interventions to maintain work capacity. In this study the aim was to evaluate if we can automate the recognition of these psychosocial risk factors in occupational health check electronic records with natural language processing (NLP). Materials and methods: We compared supervised and unsupervised named entity recognition (NER) to detect psychosocial risk factors from health checksâ documentation. Occupational health nurses have done these records. Results: Both methods found over 60% of psychosocial risk factors from the records. However, the combination of BERT-NER (supervised NER) and QExp (query expansion/paraphrasing) seems to be more suitable. In both methods the most (correct) risk factors were found in the work environment and equipment category. Conclusion: This study showed that it was possible to detect risk factors automatically from free-text documentation of health checks. It is possible to develop a text mining tool to automate the detection of psychosocial risk factors at an early stage</p
Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling
We present a system for automatically identifying a multitude of
biomedical entities from the literature. This work is based on our
previous efforts in the BioCreative VI: Interactive Bio-ID Assignment
shared task in which our system demonstrated state-of-the-art
performance with the highest achieved results in named entity
recognition. In this paper we describe the original conditional random
field-based system used in the shared task as well as experiments
conducted since, including better hyperparameter tuning and character
level modeling, which led to further performance improvements. For
normalizing the mentions into unique identifiers we use fuzzy character n-gram
matching. The normalization approach has also been improved with a
better abbreviation resolution method and stricter guideline compliance
resulting in vastly improved results for various entity types. All tools
and models used for both named entity recognition and normalization are
publicly available under open license.</p
Application of the EVEX resource to event extraction and network construction : shared task entry and result analysis
BACKGROUND : Modern methods for mining biomolecular interactions from literature typically make predictions
based solely on the immediate textual context, in effect a single sentence. No prior work has been published on
extending this context to the information automatically gathered from the whole biomedical literature. Thus, our
motivation for this study is to explore whether mutually supporting evidence, aggregated across several
documents can be utilized to improve the performance of the state-of-the-art event extraction systems.
In this paper, we describe our participation in the latest BioNLP Shared Task using the large-scale text mining
resource EVEX. We participated in the Genia Event Extraction (GE) and Gene Regulation Network (GRN) tasks with
two separate systems. In the GE task, we implemented a re-ranking approach to improve the precision of an
existing event extraction system, incorporating features from the EVEX resource. In the GRN task, our system relied
solely on the EVEX resource and utilized a rule-based conversion algorithm between the EVEX and GRN formats.
RESULTS : In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first
rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the
usage of distributed vector representations for this challenge.
In the GRN task, we ranked fifth in the official results with a strict/relaxed SER score of 0.92/0.81 respectively. To try
and improve upon these results, we have implemented a novel machine learning based conversion system and
benchmarked its performance against the original rule-based system.
CONCLUSIONS : For the GRN task, we were able to produce a gene regulatory network from the EVEX data,
warranting the use of such generic large-scale text mining data in network biology settings. A detailed
performance and error analysis provides more insight into the relatively low recall rates.
In the GE task we demonstrate that both the re-ranking approach and the word vectors can provide slight
performance improvement. A manual evaluation of the re-ranking results pinpoints some of the challenges faced
in applying large-scale text mining knowledge to event extraction.Computational resources were provided by CSC IT Center for Science Ltd.,
Espoo, Finland. The work of KH and FG was supported by the Academy of
Finland, and of SVL by the Research Foundation Flanders (FWO). YVdP and
SVL acknowledge the support from Ghent University (Multidisciplinary
Research Partnership Bioinformatics: from nucleotides to networks).http://www.biomedcentral.com/bmcbioinformaticsam201
Neural Network and Random Forest Models in Protein Function Prediction
Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.</p
Assisting nurses in care documentation: from automated sentence classification to coherent document structures with subject headings
Background:
Up to 35% of nurses' working time is spent on care documentation.
We describe the evaluation of a system aimed at assisting nurses in
documenting patient care and potentially reducing the documentation
workload. Our goal is to enable nurses to write or dictate nursing notes
in a narrative manner without having to manually structure their text
under subject headings. In the current care classification standard used
in the targeted hospital, there are more than 500 subject headings to
choose from, making it challenging and time consuming for nurses to use.
Methods:
The task of the presented system is to automatically group
sentences into paragraphs and assign subject headings. For
classification the system relies on a neural network-based text
classification model. The nursing notes are initially classified on
sentence level. Subsequently coherent paragraphs are constructed from
related sentences.
Results:
Based on a manual evaluation conducted by a group of three domain
experts, we find that in about 69% of the paragraphs formed by the
system the topics of the sentences are coherent and the assigned
paragraph headings correctly describe the topics. We also show that the
use of a paragraph merging step reduces the number of paragraphs
produced by 23% without affecting the performance of the system.
Conclusions:
The study shows that the presented system produces a coherent and
logical structure for freely written nursing narratives and has the
potential to reduce the time and effort nurses are currently spending on
documenting care in hospitals.
</div
- âŚ