694 research outputs found
The role of statistical and semantic features in single-document extractive summarization
This paper reports on the further results of the ongoing research analyzing the impact of a range of commonly used statistical and semantic features in the context of extractive text summarization. The features experimented with include word frequency, inverse sentence and term frequencies, stopwords filtering, word senses, resolved anaphora and textual entailment. The obtained results demonstrate the relative importance of each feature and the limitations of the tools available. It has been shown that the inverse sentence frequency combined with the term frequency yields almost the same results as the latter combined with stopwords filtering that in its turn proved to be a highly competitive baseline. To improve the suboptimal results of anaphora resolution, the system was extended with the second anaphora resolution module. The present paper also describes the first attempts of the internal document data representation
Building Legal Case Retrieval Systems with Lexical Matching and Summarization using A Pre-Trained Phrase Scoring Model
We present our method for tackling the legal case retrieval task of the
Competition on Legal Information Extraction/Entailment 2019. Our approach is
based on the idea that summarization is important for retrieval. On one hand,
we adopt a summarization based model called encoded summarization which encodes
a given document into continuous vector space which embeds the summary
properties of the document. We utilize the resource of COLIEE 2018 on which we
train the document representation model. On the other hand, we extract lexical
features on different parts of a given query and its candidates. We observe
that by comparing different parts of the query and its candidates, we can
achieve better performance. Furthermore, the combination of the lexical
features with latent features by the summarization-based method achieves even
better performance. We have achieved the state-of-the-art result for the task
on the benchmark of the competition
Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles
Recent advances in natural language processing have enabled automation of a
wide range of tasks, including machine translation, named entity recognition,
and sentiment analysis. Automated summarization of documents, or groups of
documents, however, has remained elusive, with many efforts limited to
extraction of keywords, key phrases, or key sentences. Accurate abstractive
summarization has yet to be achieved due to the inherent difficulty of the
problem, and limited availability of training data. In this paper, we propose a
topic-centric unsupervised multi-document summarization framework to generate
extractive and abstractive summaries for groups of scientific articles across
20 Fields of Study (FoS) in Microsoft Academic Graph (MAG) and news articles
from DUC-2004 Task 2. The proposed algorithm generates an abstractive summary
by developing salient language unit selection and text generation techniques.
Our approach matches the state-of-the-art when evaluated on automated
extractive evaluation metrics and performs better for abstractive summarization
on five human evaluation metrics (entailment, coherence, conciseness,
readability, and grammar). We achieve a kappa score of 0.68 between two
co-author linguists who evaluated our results. We plan to publicly share
MAG-20, a human-validated gold standard dataset of topic-clustered research
articles and their summaries to promote research in abstractive summarization.Comment: 6 pages, 6 Figures, 8 Tables. Accepted at IEEE Big Data 2020
(https://bigdataieee.org/BigData2020/AcceptedPapers.html
Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks
We study existing approaches to leverage off-the-shelf Natural Language
Inference (NLI) models for the evaluation of summary faithfulness and argue
that these are sub-optimal due to the granularity level considered for premises
and hypotheses. That is, the smaller content unit considered as hypothesis is a
sentence and premises are made up of a fixed number of document sentences. We
propose a novel approach, namely InFusE, that uses a variable premise size and
simplifies summary sentences into shorter hypotheses. Departing from previous
studies which focus on single short document summarisation, we analyse NLI
based faithfulness evaluation for diverse summarisation tasks. We introduce
DiverSumm, a new benchmark comprising long form summarisation (long documents
and summaries) and diverse summarisation tasks (e.g., meeting and
multi-document summarisation). In experiments, InFusE obtains superior
performance across the different summarisation tasks. Our code and data are
available at https://github.com/HJZnlp/infuse.Comment: EACL 202
COMPENDIUM: a text summarisation tool for generating summaries of multiple purposes, domains, and genres
In this paper, we present a Text Summarisation tool, compendium, capable of generating the most common types of summaries. Regarding the input, single- and multi-document summaries can be produced; as the output, the summaries can be extractive or abstractive-oriented; and finally, concerning their purpose, the summaries can be generic, query-focused, or sentiment-based. The proposed architecture for compendium is divided in various stages, making a distinction between core and additional stages. The former constitute the backbone of the tool and are common for the generation of any type of summary, whereas the latter are used for enhancing the capabilities of the tool. The main contributions of compendium with respect to the state-of-the-art summarisation systems are that (i) it specifically deals with the problem of redundancy, by means of textual entailment; (ii) it combines statistical and cognitive-based techniques for determining relevant content; and (iii) it proposes an abstractive-oriented approach for facing the challenge of abstractive summarisation. The evaluation performed in different domains and textual genres, comprising traditional texts, as well as texts extracted from the Web 2.0, shows that compendium is very competitive and appropriate to be used as a tool for generating summaries.This research has been supported by the project “Desarrollo de TĂ©cnicas Inteligentes e Interactivas de MinerĂa de Textos” (PROMETEO/2009/119) and the project reference ACOMP/2011/001 from the Valencian Government, as well as by the Spanish Government (grant no. TIN2009-13391-C04-01)
- …