37 research outputs found

    Question-driven text summarization with extractive-abstractive frameworks

    Get PDF
    Automatic Text Summarisation (ATS) is becoming increasingly important due to the exponential growth of textual content on the Internet. The primary goal of an ATS system is to generate a condensed version of the key aspects in the input document while minimizing redundancy. ATS approaches are extractive, abstractive, or hybrid. The extractive approach selects the most important sentences in the input document(s) and then concatenates them to form the summary. The abstractive approach represents the input document(s) in an intermediate form and then constructs the summary using different sentences than the originals. The hybrid approach combines both the extractive and abstractive approaches. The query-based ATS selects the information that is most relevant to the initial search query. Question-driven ATS is a technique to produce concise and informative answers to specific questions using a document collection. In this thesis, a novel hybrid framework is proposed for question-driven ATS taking advantage of extractive and abstractive summarisation mechanisms. The framework consists of complementary modules that work together to generate an effective summary: (1) discovering appropriate non-redundant sentences as plausible answers using a multi-hop question answering system based on a Convolutional Neural Network (CNN), multi-head attention mechanism and reasoning process; and (2) a novel paraphrasing Generative Adversarial Network (GAN) model based on transformers rewrites the extracted sentences in an abstractive setup. In addition, a fusing mechanism is proposed for compressing the sentence pairs selected by a next sentence prediction model in the paraphrased summary. Extensive experiments on various datasets are performed, and the results show the model can outperform many question-driven and query-based baseline methods. The proposed model is adaptable to generate summaries for the questions in the closed domain and open domain. An online summariser demo is designed based on the proposed model for the industry use to process the technical text

    Designing coherent and engaging open-domain conversational AI systems

    Get PDF
    Designing conversational AI systems able to engage in open-domain ‘social’ conversation is extremely challenging and a frontier of current research. Such systems are required to have extensive awareness of the dialogue context and world knowledge, the user intents and interests, requiring more complicated language understanding, dialogue management, and state and topic tracking mechanisms compared to traditional task-oriented dialogue systems. Given the wide coverage of topics in open-domain dialogue, the conversation can span multiple turns where a number of complex linguistic phenomena (e.g. ellipsis and anaphora) are present and should be resolved for the system to be contextually aware. Such systems also need to be engaging, keeping the users’ interest over long conversations. These are only some of the challenges that open-domain dialogue systems face. Therefore this thesis focuses on designing dialogue systems able to hold extensive open-domain conversations in a coherent, engaging, and appropriate manner over multiple turns. First, different types of dialogue systems architecture and design decisions are discussed for social open-domain conversations, along with relevant evaluation metrics. A modular architecture for ensemble-based conversational systems is presented, called Alana, a finalist in the Amazon Alexa Prize Challenge in 2017 and 2018, able to tackle many of the challenges for open-domain social conversation. The system combines different features such as topic tracking, contextual Natural Language understanding, entity linking, user modelling, information retrieval, and response ranking, using a rich representation of dialogue state. The thesis next analyses the performance of the 2017 system and describes the upgrades developed for the 2018 system. This leads to an analysis and comparison of the real-user data collected in both years with different system configurations, allowing assessment of the impact of different design decisions and modules. Finally, Alana was integrated into an embodied robotic platform and enhanced with the ability to also perform tasks. This system was deployed and evaluated in a shopping mall in Finland. Further analysis of the added embodiment is presented and discussed, as well as the challenges of translating open-domain dialogue systems into other languages. Data analysis of the collected real-user data shows the importance of a variety of features developed and decisions made in the design of the Alana system

    Automatic Text Summarization for Hindi Using Real Coded Genetic Algorithm

    Get PDF
    In the present scenario, Automatic Text Summarization (ATS) is in great demand to address the ever-growing volume of text data available online to discover relevant information faster. In this research, the ATS methodology is proposed for the Hindi language using Real Coded Genetic Algorithm (RCGA) over the health corpus, available in the Kaggle dataset. The methodology comprises five phases: preprocessing, feature extraction, processing, sentence ranking, and summary generation. Rigorous experimentation on varied feature sets is performed where distinguishing features, namely- sentence similarity and named entity features are combined with others for computing the evaluation metrics. The top 14 feature combinations are evaluated through Recall-Oriented Understudy for Gisting Evaluation (ROUGE) measure. RCGA computes appropriate feature weights through strings of features, chromosomes selection, and reproduction operators: Simulating Binary Crossover and Polynomial Mutation. To extract the highest scored sentences as the corpus summary, different compression rates are tested. In comparison with existing summarization tools, the ATS extractive method gives a summary reduction of 65%

    Guiding Abstractive Summarization using Structural Information

    Get PDF
    Abstractive summarization takes a set of sentences from a source document and reproduces its salient information using the summarizer's own words into a summary. Produced summaries may contain novel words and have different grammatical structures from the source document. In a sense, abstractive summarization is closer to how a human summarizes, yet it is also more difficult to automate since it requires a full understanding of the natural language. However, with the inception of deep learning, many new summarization systems achieved improved automatic and manual evaluation scores. One prominent deep learning model is the sequence-to-sequence model with an attention-based mechanism. Moreover, the advent of pre-trained language models over a huge set of unlabeled data further improved the performance of a summarization system. However, with all the said improvements, abstractive summarization is still adversely affected by hallucination and disfluency. Furthermore, all these recent works that used a seq2seq model require a large dataset since the underlying neural network easily overfits on a small dataset resulting in a poor approximation and high variance outputs. The problem is that these large datasets often came with only a single reference summary for each source document despite that it is known that human annotators are subject to a certain degree of subjectivity when writing a summary. We addressed the first problem by using a mechanism where the model uses a guidance signal to control what tokens are to be generated. A guidance signal can be defined as different types of signals that are fed into the model in addition to the source document where a commonly used one is structural information from the source document. Recent approaches showed good results using this approach, however, they were using a joint-training approach for the guiding mechanism, in other words, the model needs to be re-trained if a different guidance signal is used which is costly. We propose approaches that work without re-training and therefore are more flexible with regards to the guidance signal source and also computationally cheaper. We performed two different experiments where the first one is a novel guided mechanism that extends previous work on abstractive summarization using Abstract Meaning Representation (AMR) with a neural language generation stage which we guide using side information. Results showed that our approach improves over a strong baseline by 2 ROUGE-2 points. The second experiment is a guided key-phrase extractor for more informative summarization. This experiment showed mixed results, but we provide an analysis of the negative and positive output examples. The second problem was addressed by our proposed manual evaluation framework called Highlight-based Reference-less Evaluation Summarization (HighRES). The proposed framework avoids reference bias and provides absolute instead of ranked evaluation of the systems. To validate our approach we employed crowd-workers to augment with highlights on the eXtreme SUMmarization (XSUM) dataset which is a highly abstractive summarization dataset. We then compared two abstractive systems (Pointer Generator and T-Conv) to demonstrate our approach. Results showed that HighRES improves inter-annotator agreement in comparison to using the source document directly, while it also emphasizes differences among systems that would be ignored under other evaluation approaches. Our work also produces annotated dataset which gives more understanding on how humans select salient information from the source document

    A Comparative Study of Text Summarization on E-mail Data Using Unsupervised Learning Approaches

    Get PDF
    Over the last few years, email has met with enormous popularity. People send and receive a lot of messages every day, connect with colleagues and friends, share files and information. Unfortunately, the email overload outbreak has developed into a personal trouble for users as well as a financial concerns for businesses. Accessing an ever-increasing number of lengthy emails in the present generation has become a major concern for many users. Email text summarization is a promising approach to resolve this challenge. Email messages are general domain text, unstructured and not always well developed syntactically. Such elements introduce challenges for study in text processing, especially for the task of summarization. This research employs a quantitative and inductive methodologies to implement the Unsupervised learning models that addresses summarization task problem, to efficiently generate more precise summaries and to determine which approach of implementing Unsupervised clustering models outperform the best. The precision score from ROUGE-N metrics is used as the evaluation metrics in this research. This research evaluates the performance in terms of the precision score of four different approaches of text summarization by using various combinations of feature embedding technique like Word2Vec /BERT model and hybrid/conventional clustering algorithms. The results reveals that both the approaches of using Word2Vec and BERT feature embedding along with hybrid PHA-ClusteringGain k-Means algorithm achieved increase in the precision when compared with the conventional k-means clustering model. Among those hybrid approaches performed, the one using Word2Vec as feature embedding method attained 55.73% as maximum precision value

    Query-based summarization using reinforcement learning and transformer model

    Get PDF
    Query-based summarization problem is an interesting problem in the text summarization field. On the other hand, the reinforcement learning technique is popular for robotics and becoming accessible for the text summarization problem in the last couple of years (Narayan et al., 2018). The lack of significant works using reinforcement learning to solve the query-based summarization problem inspired us to use this technique. While doing so, We also introduce a different approach for sentence ranking and clustering to avoid redundancy in summaries. We propose an unsupervised extractive summarization method, which provides state-of-the-art results on some metrics. We develop two abstractive multi-document summarization models using the reinforcement learning technique and the transformer model (Vaswani et al., 2017). We consider the importance of information coverage and diversity under a fixed sentence limit for our summarization models. We have done several experiments for our proposed models, which bring significant results across different evaluation metrics
    corecore