Search CORE

173 research outputs found

Beyond Extractive: Advancing Abstractive Automatic Text Summarization in Norwegian with Transformers

Author: Korsvik Jon-Mikkel Ryen
Navjord Jørgen Johnsen
Publication venue: Norwegian University of Life Sciences
Publication date: 01/01/2023
Field of study

Automatic summarization is a key area in natural language processing (NLP) and machine learning which attempts to generate informative summaries of articles and documents. Despite its evolution since the 1950s, research on automatically summarising Norwegian text has remained relatively underdeveloped. Though there have been some strides made in extractive systems, which generate summaries by selecting and condensing key phrases directly from the source material, the field of abstractive summarization remains unexplored for the Norwegian language. Abstractive summarization is distinct as it generates summaries incorporating new words and phrases not present in the original text. This Master's thesis revolves around one key question: Is it possible to create a machine learning system capable of performing abstractive summarization in Norwegian? To answer this question, we generate and release the first two Norwegian datasets for creating and evaluating Norwegian summarization models. One of these datasets is a web scrape of Store Norske Leksikon (SNL), and the other is a machine-translated version of CNN/Daily Mail. Using these datasets, we fine-tune two Norwegian T5 language models with 580M and 1.2B parameters to create summaries. To assess the quality of the models, we employed both automatic ROUGE scores and human evaluations on the generated summaries. In an effort to better understand the model's behaviour, we measure how a model generates summaries with various metrics, including our own novel contribution which we name "Match Ratio" which measures sentence similarities between summaries and articles based on Levenshtein distances. The top-performing models achieved ROUGE-1 scores of 35.07 and 34.02 on SNL and CNN/DM, respectively. In terms of human evaluation, the best model yielded an average score of 3.96/5.00 for SNL and 4.64/5.00 for CNN/Daily Mail across various criteria. Based on these results, we conclude that it is possible to perform abstractive summarization of Norwegian with high-quality summaries. With this research, we have laid a foundation that hopefully will facilitate future research, empowering others to build upon our findings and contribute further to the development of Norwegian summarization models

Brage NMBU

AI approaches to understand human deceptions, perceptions, and perspectives in social media

Author: Li Chih-Yuan
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2023
Field of study

Social media platforms have created virtual space for sharing user generated information, connecting, and interacting among users. However, there are research and societal challenges: 1) The users are generating and sharing the disinformation 2) It is difficult to understand citizens\u27 perceptions or opinions expressed on wide variety of topics; and 3) There are overloaded information and echo chamber problems without overall understanding of the different perspectives taken by different people or groups. This dissertation addresses these three research challenges with advanced AI and Machine Learning approaches. To address the fake news, as deceptions on the facts, this dissertation presents Machine Learning approaches for fake news detection models, and a hybrid method for topic identification, whether they are fake or real. To understand the user\u27s perceptions or attitude toward some topics, this study analyzes the sentiments expressed in social media text. The sentiment analysis of posts can be used as an indicator to measure how topics are perceived by the users and how their perceptions as a whole can affect decision makers in government and industry, especially during the COVID-19 pandemic. It is difficult to measure the public perception of government policies issued during the pandemic. The citizen responses to the government policies are diverse, ranging from security or goodwill to confusion, fear, or anger. This dissertation provides a near real-time approach to track and monitor public reactions toward government policies by continuously collecting and analyzing Twitter posts about the COVID-19 pandemic. To address the social media\u27s overwhelming number of posts, content echo-chamber, and information isolation issue, this dissertation provides a multiple view-based summarization framework where the same contents can be summarized according to different perspectives. This framework includes components of choosing the perspectives, and advanced text summarization approaches. The proposed approaches in this dissertation are demonstrated with a prototype system to continuously collect Twitter data about COVID-19 government health policies and provide analysis of citizen concerns toward the policies, and the data is analyzed for fake news detection and for generating multiple-view summaries

Digital Commons @ New Jersey Institute of Technology (NJIT)

Approximate Inference for Determinantal Point Processes

Author: Gillenwater Jennifer Ann
Publication venue: ScholarlyCommons
Publication date: 01/01/2014
Field of study

In this thesis we explore a probabilistic model that is well-suited to a variety of subset selection tasks: the determinantal point process (DPP). DPPs were originally developed in the physics community to describe the repulsive interactions of fermions. More recently, they have been applied to machine learning problems such as search diversification and document summarization, which can be cast as subset selection tasks. A challenge, however, is scaling such DPP-based methods to the size of the datasets of interest to this community, and developing approximations for DPP inference tasks whose exact computation is prohibitively expensive. A DPP defines a probability distribution over all subsets of a ground set of items. Consider the inference tasks common to probabilistic models, which include normalizing, marginalizing, conditioning, sampling, estimating the mode, and maximizing likelihood. For DPPs, exactly computing the quantities necessary for the first four of these tasks requires time cubic in the number of items or features of the items. In this thesis, we propose a means of making these four tasks tractable even in the realm where the number of items and the number of features is large. Specifically, we analyze the impact of randomly projecting the features down to a lower-dimensional space and show that the variational distance between the resulting DPP and the original is bounded. In addition to expanding the circumstances in which these first four tasks are tractable, we also tackle the other two tasks, the first of which is known to be NP-hard (with no PTAS) and the second of which is conjectured to be NP-hard. For mode estimation, we build on submodular maximization techniques to develop an algorithm with a multiplicative approximation guarantee. For likelihood maximization, we exploit the generative process associated with DPP sampling to derive an expectation-maximization (EM) algorithm. We experimentally verify the practicality of all the techniques that we develop, testing them on applications such as news and research summarization, political candidate comparison, and product recommendation

ScholarlyCommons@Penn

Recommended from our members

Social Network Extraction from Text

Author: Agarwal Apoorv
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

In the pre-digital age, when electronically stored information was non-existent, the only ways of creating representations of social networks were by hand through surveys, inter- views, and observations. In this digital age of the internet, numerous indications of social interactions and associations are available electronically in an easy to access manner as structured meta-data. This lessens our dependence on manual surveys and interviews for creating and studying social networks. However, there are sources of networks that remain untouched simply because they are not associated with any meta-data. Primary examples of such sources include the vast amounts of literary texts, news articles, content of emails, and other forms of unstructured and semi-structured texts. The main contribution of this thesis is the introduction of natural language processing and applied machine learning techniques for uncovering social networks in such sources of unstructured and semi-structured texts. Specifically, we propose three novel techniques for mining social networks from three types of texts: unstructured texts (such as literary texts), emails, and movie screenplays. For each of these types of texts, we demonstrate the utility of the extracted networks on three applications (one for each type of text)

Columbia University Academic Commons

Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)

Author: ATSERIAS Jordi
BALAHUR-DOBRESCU ALEXANDRA
CASELLI Tommaso
FINLAYSON Mark
MILLER Ben
MINARD Anne-Lyse
VAN ERP Marieke
VOSSEN Piek
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 29/06/2015
Field of study

This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory 2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015) at the China National Convention Center in Beijing, on July 31st 2015. Narratives are at the heart of information sharing. Ever since people began to share their experiences, they have connected them to form narratives. The study od storytelling and the field of literary theory called narratology have developed complex frameworks and models related to various aspects of narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point of view, narrative voice, narrative goals, and many others. These notions from narratology have been applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g. Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an autonomous field of study and research. Narrative has been the focus of a number of workshops and conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML by Mani (2013)). The workshop aimed at bringing together researchers from different communities working on representing and extracting narrative structures in news, a text genre which is highly used in NLP but which has received little attention with respect to narrative structure, representation and analysis. Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic extraction of events from single documents and work towards extracting story structures from multiple documents, while these documents are published over time as news streams. Policy makers, NGOs, information specialists (such as journalists and librarians) and others are increasingly in need of tools that support them in finding salient stories in large amounts of information to more effectively implement policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g. hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections of relevant information but also projections to the future. They form a valuable potential for exploiting news data in an innovative way.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

Comparative study of NER using Bi-LSTM-CRF with different word vectorisation techniques on DNB documents

Author: Joseph Meera
Publication venue: Norwegian University of Life Sciences, Ås
Publication date: 01/01/2021
Field of study

The presence of huge volumes of unstructured data in the form of pdf documents poses a challenge to the organizations trying to extract valuable information from it. In this thesis, we try to solve this problem as per the requirement of DNB by building an automatic information extraction system to get only the key information in which the company is interested in from the pdf documents. This is achieved by comparing the performance of named entity recognition models for automatic text extraction, built using Bi-directional Long Short Term Memory (Bi-LSTM) with a Conditional Random Field (CRF) in combination with three variations of word vectorization techniques. The word vectorisation techniques compared in this thesis include randomly generated word embeddings by the Keras embedding layer, pre-trained static word embeddings focusing on 100-dimensional GloVe embeddings and, finally, deep-contextual ELMo word embeddings. Comparison of these models helps us identify the advantages and disadvantages of using different word embeddings by analysing their effect on NER performance. This study was performed on a DNB provided data set. The comparative study showed that the NER systems built using Bi-LSTM-CRF with GloVe embeddings gave the best results with a micro F1 score of 0.868 and a macro-F1 score of 0.872 on unseen data, in comparison to a Bi-LSTM-CRF based NER using Keras embedding layer and ELMo embeddings which gave micro F1 scores of 0.858 and 0.796 and macro F1 scores of 0.848 and 0.776 respectively. The result is in contrary to our assumption that NER using deep contextualised word embeddings show better performance when compared to NER using other word embeddings. We proposed that this contradicting performance is due to the high dimensionality, and we analysed it by using a lower-dimensional word embedding. It was found that using 50-dimensional GloVe embeddings instead of 100-dimensional GloVe embeddings resulted in an improvement of the overall micro and macro F1 score from 0.87 to 0.88. Additionally, optimising the best model, which was the Bi-LSTM-CRF using 100-dimensional GloVe embeddings, by tuning in a small hyperparameter search space did not result in any improvement from the present micro F1 score of 0.87 and macro F1 score of 0.87.M30-DV Master's ThesisM-D

Brage NMBU

Document summarization with neural query modeling

Author: Xu Yumo
Publication venue: The University of Edinburgh
Publication date: 16/12/2022
Field of study

Document summarization is a natural language processing task that aims to produce a short summary that concisely delivers the most important information of a document or multiple documents. Over the last few decades, the task has drawn much attention from both academia and industry, as it provides effective tools to manage and access text information. For example, through a newswire summarization engine, users can quickly digest a cluster of news articles by reading a short summary of the topic. Such summaries can, meanwhile, be used by news recommendation and question answering engines. Depending on the users’ role in the summarization process, document summarization falls into two broad categories: generic summarization and query focused summarization (QFS). The former focuses on information intrinsically salient in the input text, while the latter also caters to requests explicitly specified by users. Despite the difference between generic summarization and QFS in their task formulations, we argue that all summaries address queries, even if they are not formulated explicitly. In this thesis, we introduce query modeling in the document summarization context as a critical objective for incorporating observed or latent user intent. We investigate different approaches that explore this theme with deep neural networks. We develop novel systems with neural query modeling for both extractive summarization, where summaries are composed of salient segments (e.g., sentences) from the original document(s), and abstractive summarization, where summaries are made up of words or phrases that do not exist in the input. The recent availability of large-scale datasets has driven the development of neural models that create generic summaries. However, training data in the form of queries, documents, and summaries for QFS is scarce. As most existing research in QFS has employed an extractive approach, we first consider better modeling query-cluster interactions for low-resource extractive QFS. In contrast to previous work with retrieval-style methods for assembling query-relevant summaries, we propose a framework that progressively estimates whether text segments should be included in the summary. Notably, modules of this framework can be independently developed and can leverage training data if available. We present an instantiation of this framework with distant supervision from question answering where various resources exist to identify segments which are likely to answer the query. Experiments on benchmark datasets show that our framework achieves competitive results and is robust across domains. Ideally, summaries should be abstracts, and the hidden costs incurred by annotating QA pairs should be avoided in query modeling. The second part of this thesis focuses on the low-resource challenge in abstractive QFS, and builds an abstractive QFS system which is trained query-free. Concretely, we propose to decompose the task into query modeling and conditional language modeling. For query modeling, we first introduce a uniﬁed representation for summaries and queries to exploit training resources in generic summarization, on top of which a weakly supervised model is optimized for evidence estimation. The proposed framework achieves state-of-the-art performance in generating query focused abstracts across existing benchmarks. Finally, the third part of this thesis moves beyond QFS. We provide a uniﬁed modeling framework for any kind of summarization, under the assumption that all summaries are a response to a query, which is observed in the case of QFS and latent in the case of generic summarization. We model queries as discrete latent variables over document tokens, and learn representations compatible with observed and unobserved query verbalizations. Requiring no further optimization on downstream summarization tasks, experiments show that our approach outperforms strong comparison systems across benchmarks, query types, document settings, and target domains

Edinburgh Research Archive