11 research outputs found
Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data
The abstract of a scientific paper distills the contents of the paper into a
short paragraph. In the biomedical literature, it is customary to structure an
abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT,
and CONCLUSION, but this segmentation is uncommon in other fields like computer
science. Explicit categories could be helpful for more granular, that is,
discourse-level search and recommendation. The sparsity of labeled data makes
it challenging to construct supervised machine learning solutions for automatic
discourse-level segmentation of abstracts in non-bio domains. In this paper, we
address this problem using transfer learning. In particular, we define three
discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because
these three categories are the most common. We train a deep neural network on
structured abstracts from PubMed, then fine-tune it on a small hand-labeled
corpus of computer science papers. We observe an accuracy of 75% on the test
corpus. We perform an ablation study to highlight the roles of the different
parts of the model. Our method appears to be a promising solution to the
automatic segmentation of abstracts, where the labeled data is sparse.Comment: to appear in the proceedings of JCDL'202
Improving Scientific Literature Classification: A Parameter-Efficient Transformer-Based Approach
Transformer-based models have been utilized in natural language processing (NLP) for a wide variety of tasks like summarization, translation, and conversational agents. These models can capture long-term dependencies within the input, so they have significantly more representational capabilities than Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). Nevertheless, these models require significant computational resources in terms of high memory usage, and extensive training time. In this paper, we propose a novel document categorization model, with improved parameter efficiency that encodes text using a single, lightweight, multiheaded attention encoder block. The model also uses a hybrid word and position embedding to represent input tokens. The proposed model is evaluated for the Scientific Literature Classification task (SLC) and is compared with state-of-the-art models that have previously been applied to the task. Ten datasets of varying sizes and class distributions have been employed in the experiments. The proposed model shows significant performance improvements, with a high level of efficiency in terms of parameter and computation resource requirements as compared to other transformer-based models, and outperforms previously used methods
Domain-independent Extraction of Scientific Concepts from Research Articles
We examine the novel task of domain-independent scientific concept extraction
from abstracts of scholarly articles and present two contributions. First, we
suggest a set of generic scientific concepts that have been identified in a
systematic annotation process. This set of concepts is utilised to annotate a
corpus of scientific abstracts from 10 domains of Science, Technology and
Medicine at the phrasal level in a joint effort with domain experts. The
resulting dataset is used in a set of benchmark experiments to (a) provide
baseline performance for this task, (b) examine the transferability of concepts
between domains. Second, we present two deep learning systems as baselines. In
particular, we propose active learning to deal with different domains in our
task. The experimental results show that (1) a substantial agreement is
achievable by non-experts after consultation with domain experts, (2) the
baseline system achieves a fairly high F1 score, (3) active learning enables us
to nearly halve the amount of required training data.Comment: Accepted for publishing in 42nd European Conference on IR Research,
ECIR 202
Automated Knowledge Extraction from IS Research Articles Combining Sentence Classification and Ontological Annotation
Manually analyzing large collections of research articles is a time- and resource-intensive activity, making it difficult to stay on top of the latest research findings. Limitations of automated solutions lie in limited domain knowledge and not being able to attribute extracted key terms to a focal article, related work, or background information. We aim to address this challenge by (1) developing a framework for classifying sentences in scientific publications, (2) performing several experiments comparing state-of-the-art sentence transformer algorithms with a novel few-shot learning technique and (3) automatically analyzing a corpus of articles and evaluating automated knowledge extraction capabilities. We tested our approach for combining sentence classification with ontological annotations on a manually created dataset of 1,000 sentences from Information Systems (IS) articles. The results indicate a high degree of accuracy underlining the potential for novel approaches in analyzing scientific publication
sBERT:Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification
This paper introduces a parameter-efficient transformer-based model designed for scientific literature classification. By optimizing the transformer architecture, the proposed model significantly reduces memory usage, training time, inference time, and the carbon footprint associated with large language models. The proposed approach is evaluated against various deep learning models and demonstrates superior performance in classifying scientific literature. Comprehensive experiments conducted on datasets from Web of Science, ArXiv, Nature, Springer, and Wiley reveal that the proposed model’s multi-headed attention mechanism and enhanced embeddings contribute to its high accuracy and efficiency, making it a robust solution for text classification tasks
BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale
Capturing the semantics of related biological concepts, such as genes and
mutations, is of significant importance to many research tasks in computational
biology such as protein-protein interaction detection, gene-drug association
prediction, and biomedical literature-based discovery. Here, we propose to
leverage state-of-the-art text mining tools and machine learning models to
learn the semantics via vector representations (aka. embeddings) of over
400,000 biological concepts mentioned in the entire PubMed abstracts. Our
learned embeddings, namely BioConceptVec, can capture related concepts based on
their surrounding contextual information in the literature, which is beyond
exact term match or co-occurrence-based methods. BioConceptVec has been
thoroughly evaluated in multiple bioinformatics tasks consisting of over 25
million instances from nine different biological datasets. The evaluation
results demonstrate that BioConceptVec has better performance than existing
methods in all tasks. Finally, BioConceptVec is made freely available to the
research community and general public via
https://github.com/ncbi-nlp/BioConceptVec.Comment: 33 pages, 6 figures, 7 tables, accepted by PLOS Computational Biolog