171 research outputs found
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging for Multi-Document Summarization
Automatic multi-document summarization needs to find representative sentences not only by sentence distribution to select the most important sentence but also by how informative a term is in a sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and well-spread words in the corpus but ignores the grammatical information that indicates instructive content. The presence or absence of informative content in a sentence can be indicated by grammatical information which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting method by incorporating sentence distribution and POS tagging for multi-document summarization. Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering is based on cluster importance to determine the important clusters. Sentence extraction based on sentence distribution and POS tagging is introduced to extract the representative sentences from the ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004 are compared with those of the Sentence Distribution Method. Our proposed method achieved better results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2
Implicit Entity Networks: A Versatile Document Model
The time in which we live is often referred to as the Information Age. However, it can also aptly be characterized as an age of constant information overload. Nowhere is this more present than on the Web, which serves as an endless source of news articles, blog posts, and social media messages. Of course, this overload is even greater in professions that handle the creation or extraction of information and knowledge, such as journalists, lawyers, researchers, clerks, or medical professionals. The volume of available documents and the interconnectedness of their contents are both a blessing and a curse for the contemporary information consumer. On the one hand, they provide near limitless information, but on the other hand, their consumption and comprehension requires an amount of time that many of us cannot spare. As a result, automated extraction, aggregation, and summarization techniques have risen in popularity, even though they are a long way from being comprehensive. When we, as humans, are faced with an overload of information, we tend to look for patterns that bring order into the chaos. In news, we might identify familiar political figures or celebrities, whereas we might look for expressive symptoms in medicine, or precedential cases in law. In other words, we look for known entities as reference points, and then explore the content along the lines of their relations to others entities. Unfortunately, this approach is not reflected in current document models, which do not provide a similar focus on entities. As a direct result, the retrieval of entity-centric knowledge and relations from a flood of textual information becomes more difficult than it has to be, and the inclusion of external knowledge sources is impeded.
In this thesis, we introduce implicit entity networks as a comprehensive document model that addresses this shortcoming and provides a holistic representation of document collections and document streams. Based on the premise of modelling the cooccurrence relations between terms and entities as first-class citizens, we investigate how the resulting network structure facilitates efficient and effective entity-centric search, and demonstrate the extraction of complex entity relations, as well as their summarization. We show that the implicit network model is fully compatible with dynamic streams of documents. Furthermore, we introduce document aggregation methods that are sensitive to the context of entity mentions, and can be used to distinguish between different entity relations. Beyond the relations of individual entities, we introduce network topics as a novel and scalable method for the extraction of topics from collections and streams of documents. Finally, we combine the insights gained from these applications in a versatile hypergraph document model that bridges the gap between unstructured text and structured knowledge sources
Recommended from our members
Building Intelligent and Reliable Summarization Systems
Data, in various formats, surrounds us everywhere in our daily lives, such as education, entertainment, and media. Living in the era of big data, the massive amount of web textual data has grown exponentially over the past decade. This leads to the problem of information overload, where an individual is exposed to more information than they could process. Thus, the need for an automatic text summarization (ATS) system emerges, which could transform this vast raw information into key points in the form of smaller, digestible pieces automatically.ATS systems operate by extracting or generating a concise and readable summary while preserving salient information from the original documents. Developing intelligent systems that can produce concise, fluent, and reliable summaries has been a long-standing goal in natural language processing (NLP). Significant progress has been made in recent years, thanks to breakthroughs like pre-trained language models such as BERT and GPT. However, text summarization remains a complex and multifaceted task. Similar to the cognitive process humans undertake when crafting summaries, text summarization requires the machine to first semantically understand the contents of a document, then identify and extract salient information from the document, and finally generate an accurate and faithful summary.This dissertation presents several distinct approaches to tackle the three critical steps of building ATS systems. Specifically, I first present my work to improve the modeling of long documents for extractive summarization. I introduce model HEGEL, a hypergraph neural network for long document summarization that captures high-order cross-sentence relations. HEGEL updates and learns effective sentence representations with hypergraph transformer layers and fuses different types of sentence dependencies, including latent topics, keywords, coreference, and section structure. Extensive experiments on two benchmark datasets demonstrate the effectiveness and efficiency of HEGEL in long document modeling and extractive summarization.Then I move on to the holistic extraction of salient information from documents. To address the limitation of individual sentence label prediction in existing extractive summarization systems, I propose a novel paradigm for extractive summarization named DiffuSum. DiffuSum directly generates the desired summary sentence representations with diffusion models and extracts sentences based on sentence representation matching. Additionally, DiffuSum jointly optimizes a contrastive sentence encoder with a matching loss for sentence representation alignment and a multi-class contrastive loss for representation diversity. On the other hand, I also introduce a new holistic framework for unsupervised multi-document extractive summarization. The method incorporates the holistic beam search inference method associated with the holistic measurements, named Subset Representative Index (SRI). SRI balances the importance and diversity of a subset of sentences from the source documents and can be calculated in unsupervised and adaptive manners.Next, I demonstrate my work on improving the quality and faithfulness of generated summaries. While text summarization systems have made significant progress in recent years, they typically generate summaries in one single step. However, the one-shot summarization setting is sometimes inadequate, as the generated summary may contain hallucinations or overlook essential details related to the reader's interests. To address this, I propose SummIt, an iterative text summarization framework based on large language models (LLMs) like ChatGPT. SummIt enables the model to refine the generated summary iteratively through self-evaluation and feedback, resembling humans' iterative process when drafting and revising summaries. Furthermore, I explore the potential benefits of integrating knowledge and topic extractors into the framework to enhance summary faithfulness and controllability. Both automatic evaluation and human studies are conducted on three benchmark summarization datasets to validate the effectiveness of the iterative refinements and to identify potential issues of over-correction.Finally, as the emergence of large language models reshapes NLP research, I present a thorough evaluation of ChatGPT's performance on extractive summarization and compare it with traditional fine-tuning methods on various benchmark datasets. The experimental analysis reveals that ChatGPT exhibits inferior extractive summarization performance in terms of ROUGE scores compared to existing supervised systems, while achieving higher performance based on LLM-based evaluation metrics. I also explore the effectiveness of in-context learning and chain-of-thought reasoning for enhancing its performance and propose an extract-then-generate pipeline with ChatGPT, which could yield significant performance improvements over abstractive baselines in terms of summary faithfulness. These observations highlight potential directions for enhancing ChatGPT's capabilities in faithful summarization using two-stage approaches.In summary, by demonstrating and examining these systems and solutions, I aim to highlight the three critical yet challenging steps in building intelligent and reliable summarization systems, which are also crucial steps towards advancing the design of a more powerful and trustworthy AI assistant. I hope future research endeavors will continue to advance along these directions
Extractive Summarization via ChatGPT for Faithful Summary Generation
Extractive summarization is a crucial task in natural language processing
that aims to condense long documents into shorter versions by directly
extracting sentences. The recent introduction of ChatGPT has attracted
significant interest in the NLP community due to its remarkable performance on
a wide range of downstream tasks. However, concerns regarding factuality and
faithfulness have hindered its practical applications for summarization
systems. This paper first presents a thorough evaluation of ChatGPT's
performance on extractive summarization and compares it with traditional
fine-tuning methods on various benchmark datasets. Our experimental analysis
reveals that ChatGPT's extractive summarization performance is still inferior
to existing supervised systems in terms of ROUGE scores. In addition, we
explore the effectiveness of in-context learning and chain-of-thought reasoning
for enhancing its performance. Furthermore, we find that applying an
extract-then-generate pipeline with ChatGPT yields significant performance
improvements over abstractive baselines in terms of summary faithfulness. These
observations highlight potential directions for enhancing ChatGPT's
capabilities for faithful text summarization tasks using two-stage approaches.Comment: Work in progres
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
- …