14 research outputs found

    HTSS: A novel hybrid text summarisation and simplification architecture

    Get PDF
    Text simplification and text summarisation are related, but different sub-tasks in Natural Language Generation. Whereas summarisation attempts to reduce the length of a document, whilst keeping the original meaning, simplification attempts to reduce the complexity of a document. In this work, we combine both tasks of summarisation and simplification using a novel hybrid architecture of abstractive and extractive summarisation called HTSS. We extend the well-known pointer generator model for the combined task of summarisation and simplification. We have collected our parallel corpus from the simplified summaries written by domain experts published on the science news website EurekaAlert (www.eurekalert.org). Our results show that our proposed HTSS model outperforms neural text simplification (NTS) on SARI score and abstractive text summarisation (ATS) on the ROUGE score. We further introduce a new metric (CSS1) which combines SARI and Rouge and demonstrates that our proposed HTSS model outperforms NTS and ATS on the joint task of simplification and summarisation by 38.94% and 53.40%, respectively. We provide all code, models and corpora to the scientific community for future research at the following URL: https://github.com/slab-itu/HTSS/

    Principled Approaches to Automatic Text Summarization

    Get PDF
    Automatic text summarization is a particularly challenging Natural Language Processing (NLP) task involving natural language understanding, content selection and natural language generation. In this thesis, we concentrate on the content selection aspect, the inherent problem of summarization which is controlled by the notion of information Importance. We present a simple and intuitive formulation of the summarization task as two components: a summary scoring function θ measuring how good a text is as a summary of the given sources, and an optimization technique O extracting a summary with a high score according to θ. This perspective offers interesting insights over previous summarization efforts and allows us to pinpoint promising research directions. In particular, we realize that previous works heavily constrained the summary scoring function in order to solve convenient optimization problems (e.g., Integer Linear Programming). We question this assumption and demonstrate that General Purpose Optimization (GPO) techniques like genetic algorithms are practical. These GPOs do not require mathematical properties from the objective function and, thus, the summary scoring function can be relieved from its previously imposed constraints. Additionally, the summary scoring function can be evaluated on its own based on its ability to correlate with humans. This offers a principled way of examining the inner workings of summarization systems and complements the traditional evaluations of the extracted summaries. In fact, evaluation metrics are also summary scoring functions which should correlate well with humans. Thus, the two main challenges of summarization, the evaluation and the development of summarizers, are unified within the same setup: discovering strong summary scoring functions. Hence, we investigated ways of uncovering such functions. First, we conducted an empirical study of learning the summary scoring function from data. The results show that an unconstrained summary scoring function is better able to correlate with humans. Furthermore, an unconstrained summary scoring function optimized approximately with GPO extracts better summaries than a constrained summary scoring function optimized exactly with, e.g., ILP. Along the way, we proposed techniques to leverage the small and biased human judgment datasets. Additionally, we released a new evaluation metric explicitly trained to maximize its correlation with humans. Second, we developed a theoretical formulation of the notion of Importance. In a framework rooted in information theory, we defined the quantities: Redundancy, Relevance and Informativeness. Importance arises as the notion unifying these concepts. More generally, Importance is the measure that guides which choices to make when information must be discarded. Finally, evaluation remains an open-problem with a massive impact on summarization progress. Thus, we conducted experiments on available human judgment datasets commonly used to compare evaluation metrics. We discovered that these datasets do not cover the high-quality range in which summarization systems and evaluation metrics operate. This motivates efforts to collect human judgments for high-scoring summaries as this would be necessary to settle the debate over which metric to use. This would also be greatly beneficial for improving summarization systems and metrics alike

    Document analysis by means of data mining techniques

    Get PDF
    The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed

    Question-driven text summarization with extractive-abstractive frameworks

    Get PDF
    Automatic Text Summarisation (ATS) is becoming increasingly important due to the exponential growth of textual content on the Internet. The primary goal of an ATS system is to generate a condensed version of the key aspects in the input document while minimizing redundancy. ATS approaches are extractive, abstractive, or hybrid. The extractive approach selects the most important sentences in the input document(s) and then concatenates them to form the summary. The abstractive approach represents the input document(s) in an intermediate form and then constructs the summary using different sentences than the originals. The hybrid approach combines both the extractive and abstractive approaches. The query-based ATS selects the information that is most relevant to the initial search query. Question-driven ATS is a technique to produce concise and informative answers to specific questions using a document collection. In this thesis, a novel hybrid framework is proposed for question-driven ATS taking advantage of extractive and abstractive summarisation mechanisms. The framework consists of complementary modules that work together to generate an effective summary: (1) discovering appropriate non-redundant sentences as plausible answers using a multi-hop question answering system based on a Convolutional Neural Network (CNN), multi-head attention mechanism and reasoning process; and (2) a novel paraphrasing Generative Adversarial Network (GAN) model based on transformers rewrites the extracted sentences in an abstractive setup. In addition, a fusing mechanism is proposed for compressing the sentence pairs selected by a next sentence prediction model in the paraphrased summary. Extensive experiments on various datasets are performed, and the results show the model can outperform many question-driven and query-based baseline methods. The proposed model is adaptable to generate summaries for the questions in the closed domain and open domain. An online summariser demo is designed based on the proposed model for the industry use to process the technical text

    A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4

    Full text link
    Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family large language models.Comment: Preprint under review, 58 page

    Toward summarization of communicative activities in spoken conversation

    Get PDF
    This thesis is an inquiry into the nature and structure of face-to-face conversation, with a special focus on group meetings in the workplace. I argue that conversations are composed of episodes, each of which corresponds to an identifiable communicative activity such as giving instructions or telling a story. These activities are important because they are part of participants’ commonsense understanding of what happens in a conversation. They appear in natural summaries of conversations such as meeting minutes, and participants talk about them within the conversation itself. Episodic communicative activities therefore represent an essential component of practical, commonsense descriptions of conversations. The thesis objective is to provide a deeper understanding of how such activities may be recognized and differentiated from one another, and to develop a computational method for doing so automatically. The experiments are thus intended as initial steps toward future applications that will require analysis of such activities, such as an automatic minute-taker for workplace meetings, a browser for broadcast news archives, or an automatic decision mapper for planning interactions. My main theoretical contribution is to propose a novel analytical framework called participant relational analysis. The proposal argues that communicative activities are principally indicated through participant-relational features, i.e., expressions of relationships between participants and the dialogue. Participant-relational features, such as subjective language, verbal reference to the participants, and the distribution of speech activity amongst the participants, are therefore argued to be a principal means for analyzing the nature and structure of communicative activities. I then apply the proposed framework to two computational problems: automatic discourse segmentation and automatic discourse segment labeling. The first set of experiments test whether participant-relational features can serve as a basis for automatically segmenting conversations into discourse segments, e.g., activity episodes. Results show that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links between content words, i.e., lexical cohesion. They also show that feature performance is highly dependent on segment type, suggesting that human-annotated “topic segments” are in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units. Analysis of commonly used evaluation measures, performed in conjunction with the segmentation experiments, reveals that they fail to penalize substantially defective results due to inherent biases in the measures. I therefore preface the experiments with a comprehensive analysis of these biases and a proposal for a novel evaluation measure. A reevaluation of state-of-the-art segmentation algorithms using the novel measure produces substantially different results from previous studies. This raises serious questions about the effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate ones to employ in the subsequent experiments. I also preface the experiments with an investigation of participant reference, an important type of participant-relational feature. I propose an annotation scheme with novel distinctions for vagueness, discourse function, and addressing-based referent inclusion, each of which are assessed for inter-coder reliability. The produced dataset includes annotations of 11,000 occasions of person-referring. The second set of experiments concern the use of participant-relational features to automatically identify labels for discourse segments. In contrast to assigning semantic topic labels, such as topical headlines, the proposed algorithm automatically labels segments according to activity type, e.g., presentation, discussion, and evaluation. The method is unsupervised and does not learn from annotated ground truth labels. Rather, it induces the labels through correlations between discourse segment boundaries and the occurrence of bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what has just occurred or what is about to occur. Results show that bracketing meta-discourse is an effective basis for identifying some labels automatically, but that its use is limited if global correlations to segment features are not employed. This thesis addresses important pre-requisites to the automatic summarization of conversation. What I provide is a novel activity-oriented perspective on how summarization should be approached, and a novel participant-relational approach to conversational analysis. The experimental results show that analysis of participant-relational features is

    A semantic metadata enrichment software ecosystem (SMESE) : its prototypes for digital libraries, metadata enrichments and assisted literature reviews

    Get PDF
    Contribution 1: Initial design of a semantic metadata enrichment ecosystem (SMESE) for Digital Libraries The Semantic Metadata Enrichments Software Ecosystem (SMESE V1) for Digital Libraries (DLs) proposed in this paper implements a Software Product Line Engineering (SPLE) process using a metadata-based software architecture approach. It integrates a components-based ecosystem, including metadata harvesting, text and data mining and machine learning models. SMESE V1 is based on a generic model for standardizing meta-entity metadata and a mapping ontology to support the harvesting of various types of documents and their metadata from the web, databases and linked open data. SMESE V1 supports a dynamic metadata-based configuration model using multiple thesauri. The proposed model defines rules-based crosswalks that create pathways to different sources of data and metadata. Each pathway checks the metadata source structure and performs data and metadata harvesting. SMESE V1 proposes a metadata model in six categories of metadata instead of the four currently proposed in the literature for DLs; this makes it possible to describe content by defined entity, thus increasing usability. In addition, to tackle the issue of varying degrees of depth, the proposed metadata model describes the most elementary aspects of a harvested entity. A mapping ontology model has been prototyped in SMESE V1 to identify specific text segments based on thesauri in order to enrich content metadata with topics and emotions; this mapping ontology also allows interoperability between existing metadata models. Contribution 2: Metadata enrichments ecosystem based on topics and interests The second contribution extends the original SMESE V1 proposed in Contribution 1. Contribution 2 proposes a set of topic- and interest-based content semantic enrichments. The improved prototype, SMESE V3 (see following figure), uses text analysis approaches for sentiment and emotion detection and provides machine learning models to create a semantically enriched repository, thus enabling topic- and interest-based search and discovery. SMESE V3 has been designed to find short descriptions in terms of topics, sentiments and emotions. It allows efficient processing of large collections while keeping the semantic and statistical relationships that are useful for tasks such as: 1. topic detection, 2. contents classification, 3. novelty detection, 4. text summarization, 5. similarity detection. Contribution 3: Metadata-based scientific assisted literature review The third contribution proposes an assisted literature review (ALR) prototype, STELLAR V1 (Semantic Topics Ecosystem Learning-based Literature Assisted Review), based on machine learning models and a semantic metadata ecosystem. Its purpose is to identify, rank and recommend relevant papers for a literature review (LR). This third prototype can assist researchers, in an iterative process, in finding, evaluating and annotating relevant papers harvested from different sources and input into the SMESE V3 platform, available at any time. The key elements and concepts of this prototype are: 1. text and data mining, 2. machine learning models, 3. classification models, 4. researchers annotations, 5. semantically enriched metadata. STELLAR V1 helps the researcher to build a list of relevant papers according to a selection of metadata related to the subject of the ALR. The following figure presents the model, the related machine learning models and the metadata ecosystem used to assist the researcher in the task of producing an ALR on a specific topic

    B!SON: A Tool for Open Access Journal Recommendation

    Get PDF
    Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF
    corecore