988 research outputs found

    A hybrid approach for text summarization using semantic latent Dirichlet allocation and sentence concept mapping with transformer

    Get PDF
    Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model

    A Graph-Based Approach for the Summarization of Scientific Articles

    Get PDF
    Automatic text summarization is one of the eminent applications in the field of Natural Language Processing. Text summarization is the process of generating a gist from text documents. The task is to produce a summary which contains important, diverse and coherent information, i.e., a summary should be self-contained. The approaches for text summarization are conventionally extractive. The extractive approaches select a subset of sentences from an input document for a summary. In this thesis, we introduce a novel graph-based extractive summarization approach. With the progressive advancement of research in the various fields of science, the summarization of scientific articles has become an essential requirement for researchers. This is our prime motivation in selecting scientific articles as our dataset. This newly formed dataset contains scientific articles from the PLOS Medicine journal, which is a high impact journal in the field of biomedicine. The summarization of scientific articles is a single-document summarization task. It is a complex task due to various reasons, one of it being, the important information in the scientific article is scattered all over it and another reason being, scientific articles contain numerous redundant information. In our approach, we deal with the three important factors of summarization: importance, non-redundancy and coherence. To deal with these factors, we use graphs as they solve data sparsity problems and are computationally less complex. We employ bipartite graphical representation for the summarization task, exclusively. We represent input documents through a bipartite graph that consists of sentence nodes and entity nodes. This bipartite graph representation contains entity transition information which is beneficial for selecting the relevant sentences for a summary. We use a graph-based ranking algorithm to rank the sentences in a document. The ranks are considered as relevance scores of the sentences which are further used in our approach. Scientific articles contain reasonable amount of redundant information, for example, Introduction and Methodology sections contain similar information regarding the motivation and approach. In our approach, we ensure that the summary contains sentences which are non-redundant. Though the summary should contain important and non-redundant information of the input document, its sentences should be connected to one another such that it becomes coherent, understandable and simple to read. If we do not ensure that a summary is coherent, its sentences may not be properly connected. This leads to an obscure summary. Until now, only few summarization approaches take care of coherence. In our approach, we take care of coherence in two different ways: by using the graph measure and by using the structural information. We employ outdegree as the graph measure and coherence patterns for the structural information, in our approach. We use integer programming as an optimization technique, to select the best subset of sentences for a summary. The sentences are selected on the basis of relevance, diversity and coherence measure. The computation of these measures is tightly integrated and taken care of simultaneously. We use human judgements to evaluate coherence of summaries. We compare ROUGE scores and human judgements of different systems on the PLOS Medicine dataset. Our approach performs considerably better than other systems on this dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare the results with the recent state-of-the-art systems. The results show that our graph-based approach outperforms other systems on DUC 2002. In conclusion, our approach is robust, i.e., it works on both scientific and news articles. Our approach has the further advantage of being semi-supervised

    Topic Similarity Networks: Visual Analytics for Large Document Sets

    Full text link
    We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networks. Visualizations of topic models based on these networks are shown to be a powerful means of exploring, characterizing, and summarizing large collections of unstructured text documents. They help to "tease out" non-obvious connections among different sets of documents and provide insights into how topics form larger themes. We demonstrate the efficacy and practicality of these approaches through two case studies: 1) NSF grants for basic research spanning a 14 year period and 2) the entire English portion of Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData 2014

    Validation of scientific topic models using graph analysis and corpus metadata

    Get PDF
    Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004870 (IntelComp project), and has also been partially supported by FEDER/ Spanish Ministry of Science, Innovation and Universities, State Agency of Research, project TEC2017-83838-R
    • …
    corecore