585 research outputs found
A Review On Automatic Text Summarization Approaches
It has been more than 50 years since the initial investigation on automatic text summarization was started.Various techniques have been successfully used to extract the important contents from text document to represent document summary.In this study,we review some of the studies that have been conducted in this still-developing research area.It covers the basics of text summarization,the types of summarization,the methods that
have been used and some areas in which text summarization has been applied.Furthermore,this paper also reviews the significant efforts which have been put in studies concerning sentence extraction,domain specific summarization and multi document summarization and provides the theoretical explanation and the fundamental concepts related to it.In addition,the advantages and limitations concerning the approaches commonly used for text summarization are also highlighted in this study
Automated extraction and semantic analysis of mutation impacts from the biomedical literature
BACKGROUND: Mutations as sources of evolution have long been the focus of attention in the biomedical literature. Accessing the mutational information and their impacts on protein properties facilitates research in various domains, such as enzymology and pharmacology. However, manually curating the rich and fast growing repository of biomedical literature is expensive and time-consuming. As a solution, text mining approaches have increasingly been deployed in the biomedical domain. While the detection of single-point mutations is well covered by existing systems, challenges still exist in grounding impacts to their respective mutations and recognizing the affected protein properties, in particular kinetic and stability properties together with physical quantities.
RESULTS: We present an ontology model for mutation impacts, together with a comprehensive text mining system for extracting and analysing mutation impact information from full-text articles. Organisms, as sources of proteins, are extracted to help disambiguation of genes and proteins. Our system then detects mutation series to correctly ground detected impacts using novel heuristics. It also extracts the affected protein properties, in particular kinetic and stability properties, as well as the magnitude of the effects and validates these relations against the domain ontology. The output of our system can be provided in various formats, in particular by populating an OWL-DL ontology, which can then be queried to provide structured information. The performance of the system is evaluated on our manually annotated corpora. In the impact detection task, our system achieves a precision of 70.4%-71.1%, a recall of 71.3%-71.5%, and grounds the detected impacts with an accuracy of 76.5%-77%. The developed system, including resources, evaluation data and end-user and developer documentation is freely available under an open source license at http://www.semanticsoftware.info/open-mutation-miner.
CONCLUSION: We present Open Mutation Miner (OMM), the first comprehensive, fully open-source approach to automatically extract impacts and related relevant information from the biomedical literature. We assessed the performance of our work on manually annotated corpora and the results show the reliability of our approach. The representation of the extracted information into a structured format facilitates knowledge management and aids in database curation and correction. Furthermore, access to the analysis results is provided through multiple interfaces, including web services for automated data integration and desktop-based solutions for end user interactions
Automated Extraction of Protein Mutation Impacts from the Biomedical Literature
Mutations as sources of evolution have long been the focus of attention in the
biomedical literature. Accessing the mutational information and their impacts
on protein properties facilitates research in various domains, such as
enzymology and pharmacology. However, manually reading through the rich and fast growing repository
of biomedical literature is expensive and time-consuming. A number of manually curated
databases, such as BRENDA (http://www.brenda-enzymes.org), try to index and provide this
information; yet the provided data seems to be incomplete. Thus, there is a
growing need for automated approaches to extract this information.
In this work, we present a system to automatically extract and summarize impact
information from protein mutations.
Our system extraction module is split into subtasks: organism analysis,
mutation detection, protein property extraction and impact
analysis. Organisms, as sources of proteins, are required to be extracted to
help disambiguation of genes and proteins. Thus, our system extracts and
grounds organisms to NCBI. We detect mutation series to correctly ground our detected
impacts. Our system also extracts the affected protein properties as well as the magnitude of the
effects.
The output of our system is populated to an OWL-DL ontology, which can then be queried to provide structured information. The performance
of the system is evaluated on both external and internal corpora and
databases. The results show the reliability of the approaches. Our Organism
extraction system achieves a precision and recall of 95%
and 94% and a grounding accuracy of 97.5% on the OT corpus. On the manually
annotated corpus of Linneaus-100, the results show a precision and recall of
99% and 97% and grounding with an accuracy of 97.4%.
In the impact detection task, our system achieves a precision and recall of
70.4%-71.8% and 71.2%-71.3% on a manually annotated documents. Our system grounds the detected
impacts with an accuracy of 70.1%-71.7% on the manually annotated documents
and a precision and recall of 57%-57.5% and 82.5%-84.2% against the BRENDA data
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks
Recently, Large Language Models (LLM) have demonstrated impressive capability
to solve a wide range of tasks. However, despite their success across various
tasks, no prior work has investigated their capability in the biomedical domain
yet. To this end, this paper aims to evaluate the performance of LLMs on
benchmark biomedical tasks. For this purpose, we conduct a comprehensive
evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets.
To the best of our knowledge, this is the first work that conducts an extensive
evaluation and comparison of various LLMs in the biomedical domain.
Interestingly, we find based on our evaluation that in biomedical datasets that
have smaller training sets, zero-shot LLMs even outperform the current
state-of-the-art fine-tuned biomedical models. This suggests that pretraining
on large text corpora makes LLMs quite specialized even in the biomedical
domain. We also find that not a single LLM can outperform other LLMs in all
tasks, with the performance of different LLMs may vary depending on the task.
While their performance is still quite poor in comparison to the biomedical
models that were fine-tuned on large training sets, our findings demonstrate
that LLMs have the potential to be a valuable tool for various biomedical tasks
that lack large annotated data.Comment: Extended version of the following BioNLP paper:
https://aclanthology.org/2023.bionlp-1.30/ (arXiv:2306.04504). arXiv admin
note: substantial text overlap with arXiv:2306.0450
Text mining for biology - the way forward: opinions from leading scientists
This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress
Automatic assignment of biomedical categories: toward a generic approach
Motivation: We report on the development of a generic text categorization system designed to automatically assign biomedical categories to any input text. Unlike usual automatic text categorization systems, which rely on data-intensive models extracted from large sets of training data, our categorizer is largely data-independent. Methods: In order to evaluate the robustness of our approach we test the system on two different biomedical terminologies: the Medical Subject Headings (MeSH) and the Gene Ontology (GO). Our lightweight categorizer, based on two ranking modules, combines a pattern matcher and a vector space retrieval engine, and uses both stems and linguistically-motivated indexing units. Results and Conclusion: Results show the effectiveness of phrase indexing for both GO and MeSH categorization, but we observe the categorization power of the tool depends on the controlled vocabulary: precision at high ranks ranges from above 90% for MeSH to <20% for GO, establishing a new baseline for categorizers based on retrieval methods. Contact: [email protected]
Specialized Named Entity Recognition For Breast Cancer Subtyping
The amount of data and analysis being published and archived in the biomedical research community is more than can feasibly be sifted through manually, which limits the information an individual or small group can synthesize and integrate into their own research. This presents an opportunity for using automated methods, including Natural Language Processing (NLP), to extract important information from text on various topics. Named Entity Recognition (NER), is one way to automate knowledge extraction of raw text. NER is defined as the task of identifying named entities from text using labels such as people, dates, locations, diseases, and proteins. There are several NLP tools that are designed for entity recognition, but rely on large established corpus for training data. Biomedical research has the potential to guide diagnostic and therapeutic decisions, yet the overwhelming density of publications acts as a barrier to getting these results into a clinical setting. An exceptional example of this is the field of breast cancer biology where over 2 million people are diagnosed worldwide every year and billions of dollars are spent on research. Breast cancer biology literature and research relies on a highly specific domain with unique language and vocabulary, and therefore requires specialized NLP tools which can generate biologically meaningful results. This thesis presents a novel annotation tool, that is optimized for quickly creating training data for spaCy pipelines as well as exploring the viability of said data for analyzing papers with automated processing. Custom pipelines trained on these annotations are shown to be able to recognize custom entities at levels comparable to large corpus based recognition
Gene Expression Profiling of Corpus luteum Reveals Important Insights about Early Pregnancy in Domestic Sheep
The majority of pregnancy loss in ruminants occurs during the preimplantation stage,
which is thus the most critical period determining reproductive success. Here, we performed a
comparative transcriptome study by sequencing total mRNA from corpus luteum (CL) collected
during the preimplantation stage of pregnancy in Finnsheep, Texel and F1 crosses. A total of 21,287
genes were expressed in our data. Highly expressed autosomal genes in the CL were associated
with biological processes such as progesterone formation (STAR, CYP11A1, and HSD3B1) and
embryo implantation (e.g., TIMP1, TIMP2 and TCTP). Among the list of differentially expressed
genes, sialic acid-binding immunoglobulin (Ig)-like lectins (SIGLEC3, SIGLEC14, SIGLEC8),
ribosomal proteins (RPL17, RPL34, RPS3A, MRPS33) and chemokines (CCL5, CCL24, CXCL13,
CXCL9) were upregulated in Finnsheep, while four multidrug resistance-associated proteins
(MRPs) were upregulated in Texel ewes. A total of 17 known genes and two uncharacterized noncoding
RNAs (ncRNAs) were differentially expressed in breed-wise comparisons owing to the
flushing diet effect. The significantly upregulated TXNL1 gene indicated potential for embryonic
diapause in Finnsheep and F1. Moreover, we report, for the first time in any species, several genes
that are active in the CL during early pregnancy (including TXNL1, SIGLEC14, SIGLEC8, MRP4, and
CA5A).202
Genomic data integration and user-defined sample-set extraction for population variant analysis
Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics
- …