11 research outputs found

    Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data

    Full text link
    The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. In particular, we define three discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because these three categories are the most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus. We perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse.Comment: to appear in the proceedings of JCDL'202

    Improving Scientific Literature Classification: A Parameter-Efficient Transformer-Based Approach

    Get PDF
    Transformer-based models have been utilized in natural language processing (NLP) for a wide variety of tasks like summarization, translation, and conversational agents. These models can capture long-term dependencies within the input, so they have significantly more representational capabilities than Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). Nevertheless, these models require significant computational resources in terms of high memory usage, and extensive training time. In this paper, we propose a novel document categorization model, with improved parameter efficiency that encodes text using a single, lightweight, multiheaded attention encoder block. The model also uses a hybrid word and position embedding to represent input tokens. The proposed model is evaluated for the Scientific Literature Classification task (SLC) and is compared with state-of-the-art models that have previously been applied to the task. Ten datasets of varying sizes and class distributions have been employed in the experiments. The proposed model shows significant performance improvements, with a high level of efficiency in terms of parameter and computation resource requirements as compared to other transformer-based models, and outperforms previously used methods

    Domain-independent Extraction of Scientific Concepts from Research Articles

    Get PDF
    We examine the novel task of domain-independent scientific concept extraction from abstracts of scholarly articles and present two contributions. First, we suggest a set of generic scientific concepts that have been identified in a systematic annotation process. This set of concepts is utilised to annotate a corpus of scientific abstracts from 10 domains of Science, Technology and Medicine at the phrasal level in a joint effort with domain experts. The resulting dataset is used in a set of benchmark experiments to (a) provide baseline performance for this task, (b) examine the transferability of concepts between domains. Second, we present two deep learning systems as baselines. In particular, we propose active learning to deal with different domains in our task. The experimental results show that (1) a substantial agreement is achievable by non-experts after consultation with domain experts, (2) the baseline system achieves a fairly high F1 score, (3) active learning enables us to nearly halve the amount of required training data.Comment: Accepted for publishing in 42nd European Conference on IR Research, ECIR 202

    Automated Knowledge Extraction from IS Research Articles Combining Sentence Classification and Ontological Annotation

    Get PDF
    Manually analyzing large collections of research articles is a time- and resource-intensive activity, making it difficult to stay on top of the latest research findings. Limitations of automated solutions lie in limited domain knowledge and not being able to attribute extracted key terms to a focal article, related work, or background information. We aim to address this challenge by (1) developing a framework for classifying sentences in scientific publications, (2) performing several experiments comparing state-of-the-art sentence transformer algorithms with a novel few-shot learning technique and (3) automatically analyzing a corpus of articles and evaluating automated knowledge extraction capabilities. We tested our approach for combining sentence classification with ontological annotations on a manually created dataset of 1,000 sentences from Information Systems (IS) articles. The results indicate a high degree of accuracy underlining the potential for novel approaches in analyzing scientific publication

    sBERT:Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification

    Get PDF
    This paper introduces a parameter-efficient transformer-based model designed for scientific literature classification. By optimizing the transformer architecture, the proposed model significantly reduces memory usage, training time, inference time, and the carbon footprint associated with large language models. The proposed approach is evaluated against various deep learning models and demonstrates superior performance in classifying scientific literature. Comprehensive experiments conducted on datasets from Web of Science, ArXiv, Nature, Springer, and Wiley reveal that the proposed model’s multi-headed attention mechanism and enhanced embeddings contribute to its high accuracy and efficiency, making it a robust solution for text classification tasks

    BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale

    Full text link
    Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public via https://github.com/ncbi-nlp/BioConceptVec.Comment: 33 pages, 6 figures, 7 tables, accepted by PLOS Computational Biolog
    corecore