564 research outputs found
Beyond Extractive: Advancing Abstractive Automatic Text Summarization in Norwegian with Transformers
Automatic summarization is a key area in natural language processing (NLP) and machine learning which attempts to generate informative summaries of articles and documents. Despite its evolution since the 1950s, research on automatically summarising Norwegian text has remained relatively underdeveloped. Though there have been some strides made in extractive systems, which generate summaries by selecting and condensing key phrases directly from the source material, the field of abstractive summarization remains unexplored for the Norwegian language. Abstractive summarization is distinct as it generates summaries incorporating new words and phrases not present in the original text.
This Master's thesis revolves around one key question: Is it possible to create a machine learning system capable of performing abstractive summarization in Norwegian? To answer this question, we generate and release the first two Norwegian datasets for creating and evaluating Norwegian summarization models. One of these datasets is a web scrape of Store Norske Leksikon (SNL), and the other is a machine-translated version of CNN/Daily Mail. Using these datasets, we fine-tune two Norwegian T5 language models with 580M and 1.2B parameters to create summaries. To assess the quality of the models, we employed both automatic ROUGE scores and human evaluations on the generated summaries. In an effort to better understand the model's behaviour, we measure how a model generates summaries with various metrics, including our own novel contribution which we name "Match Ratio" which measures sentence similarities between summaries and articles based on Levenshtein distances.
The top-performing models achieved ROUGE-1 scores of 35.07 and 34.02 on SNL and CNN/DM, respectively. In terms of human evaluation, the best model yielded an average score of 3.96/5.00 for SNL and 4.64/5.00 for CNN/Daily Mail across various criteria. Based on these results, we conclude that it is possible to perform abstractive summarization of Norwegian with high-quality summaries. With this research, we have laid a foundation that hopefully will facilitate future research, empowering others to build upon our findings and contribute further to the development of Norwegian summarization models
Live Blog Corpus for Summarization
Live blogs are an increasingly popular news format to cover breaking news and
live events in online journalism. Online news websites around the world are
using this medium to give their readers a minute by minute update on an event.
Good summaries enhance the value of the live blogs for a reader but are often
not available. In this paper, we study a way of collecting corpora for
automatic live blog summarization. In an empirical evaluation using well-known
state-of-the-art summarization systems, we show that live blogs corpus poses
new challenges in the field of summarization. We make our tools publicly
available to reconstruct the corpus to encourage the research community and
replicate our results.Comment: To appear in the Proceedings of LREC 201
Keyphrase Generation: A Multi-Aspect Survey
Extractive keyphrase generation research has been around since the nineties,
but the more advanced abstractive approach based on the encoder-decoder
framework and sequence-to-sequence learning has been explored only recently. In
fact, more than a dozen of abstractive methods have been proposed in the last
three years, producing meaningful keyphrases and achieving state-of-the-art
scores. In this survey, we examine various aspects of the extractive keyphrase
generation methods and focus mostly on the more recent abstractive methods that
are based on neural networks. We pay particular attention to the mechanisms
that have driven the perfection of the later. A huge collection of scientific
article metadata and the corresponding keyphrases is created and released for
the research community. We also present various keyphrase generation and text
summarization research patterns and trends of the last two decades.Comment: 10 pages, 5 tables. Published in proceedings of FRUCT 2019, the 25th
Conference of the Open Innovations Association FRUCT, Helsinki, Finlan
MReD: A Meta-Review Dataset for Structure-Controllable Text Generation
When directly using existing text generation datasets for controllable
generation, we are facing the problem of not having the domain knowledge and
thus the aspects that could be controlled are limited. A typical example is
when using CNN/Daily Mail dataset for controllable text summarization, there is
no guided information on the emphasis of summary sentences. A more useful text
generator should leverage both the input text and the control signal to guide
the generation, which can only be built with a deep understanding of the domain
knowledge. Motivated by this vision, our paper introduces a new text generation
dataset, named MReD. Our new dataset consists of 7,089 meta-reviews and all its
45k meta-review sentences are manually annotated with one of the 9 carefully
defined categories, including abstract, strength, decision, etc. We present
experimental results on start-of-the-art summarization models, and propose
methods for structure-controlled generation with both extractive and
abstractive models using our annotated data. By exploring various settings and
analyzing the model behavior with respect to the control signal, we demonstrate
the challenges of our proposed task and the values of our dataset MReD.
Meanwhile, MReD also allows us to have a better understanding of the
meta-review domain.Comment: 15 pages, 5 figures, accepted at ACL 202
- …