Search CORE

15 research outputs found

A Survey on Semantic Processing Techniques

Author: Cambria Erik
Chen Guanyi
He Kai
Mao Rui
Ni Jinjie
Yang Zonglin
Zhang Xulang
Publication venue
Publication date: 22/10/2023
Field of study

Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

arXiv.org e-Print Archive

Guiding Abstractive Summarization using Structural Information

Author: Hardy Hardy
Publication venue
Publication date: 01/09/2021
Field of study

Abstractive summarization takes a set of sentences from a source document and reproduces its salient information using the summarizer's own words into a summary. Produced summaries may contain novel words and have different grammatical structures from the source document. In a sense, abstractive summarization is closer to how a human summarizes, yet it is also more difficult to automate since it requires a full understanding of the natural language. However, with the inception of deep learning, many new summarization systems achieved improved automatic and manual evaluation scores. One prominent deep learning model is the sequence-to-sequence model with an attention-based mechanism. Moreover, the advent of pre-trained language models over a huge set of unlabeled data further improved the performance of a summarization system. However, with all the said improvements, abstractive summarization is still adversely affected by hallucination and disfluency. Furthermore, all these recent works that used a seq2seq model require a large dataset since the underlying neural network easily overfits on a small dataset resulting in a poor approximation and high variance outputs. The problem is that these large datasets often came with only a single reference summary for each source document despite that it is known that human annotators are subject to a certain degree of subjectivity when writing a summary. We addressed the first problem by using a mechanism where the model uses a guidance signal to control what tokens are to be generated. A guidance signal can be defined as different types of signals that are fed into the model in addition to the source document where a commonly used one is structural information from the source document. Recent approaches showed good results using this approach, however, they were using a joint-training approach for the guiding mechanism, in other words, the model needs to be re-trained if a different guidance signal is used which is costly. We propose approaches that work without re-training and therefore are more flexible with regards to the guidance signal source and also computationally cheaper. We performed two different experiments where the first one is a novel guided mechanism that extends previous work on abstractive summarization using Abstract Meaning Representation (AMR) with a neural language generation stage which we guide using side information. Results showed that our approach improves over a strong baseline by 2 ROUGE-2 points. The second experiment is a guided key-phrase extractor for more informative summarization. This experiment showed mixed results, but we provide an analysis of the negative and positive output examples. The second problem was addressed by our proposed manual evaluation framework called Highlight-based Reference-less Evaluation Summarization (HighRES). The proposed framework avoids reference bias and provides absolute instead of ranked evaluation of the systems. To validate our approach we employed crowd-workers to augment with highlights on the eXtreme SUMmarization (XSUM) dataset which is a highly abstractive summarization dataset. We then compared two abstractive systems (Pointer Generator and T-Conv) to demonstrate our approach. Results showed that HighRES improves inter-annotator agreement in comparison to using the source document directly, while it also emphasizes differences among systems that would be ignored under other evaluation approaches. Our work also produces annotated dataset which gives more understanding on how humans select salient information from the source document

White Rose E-theses Online

Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Author
Publication venue: The Association for Computational Linguistics
Publication date: 19/04/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Representing relational knowledge with language models

Author: Ushio Asahi
Publication venue
Publication date
Field of study

Relational knowledge is the ability to recognize the relationship between instances, and it has an important role in human understanding a concept or commonsense reasoning. We, humans, structure our knowledge by understanding individual instances together with the relationship among them, which enables us to further expand the knowledge. Nevertheless, modelling relational knowledge with computational models is a long-standing challenge in Natural Language Processing (NLP). The main difficulty at acquiring relational knowledge arises from the generalization capability. For pre-trained Language Model (LM), in spite of the huge impact made in NLP, relational knowledge remains understudied. In fact, GPT-3 (Brown et al., 2020), one of the largest LM at the time being with 175 billions of parameters, has shown to perform worse than a traditional statistical baseline in an analogy benchmark. Our initial results hinted at the type of relational knowledge encoded in some of the LMs. However, we found out that such knowledge can be hardly extracted with a carefully designed method tuned on a task specific validation set. According to such finding, we proposed a method (RelBERT) for distilling relational knowledge via LM fine-tuning. This method successfully retrieves flexible relation embeddings that achieve State-of-The-Art (SoTA) in various analogy benchmarks. Moreover, it exhibits a high generalization ability to be able to handle relation types that are not included in the training data. Finally, we propose a new task of modelling graded relation in named entities, which reveals some limitations of recent SoTA LMs as well as RelBERT, suggesting future research direction to model relational knowledge in the current LM era, especially when it comes to named entities

Online Research @ Cardiff

Semantics-driven Abstractive Document Summarization

Author: Alambo Amanuel
Publication venue: CORE Scholar
Publication date: 01/01/2022
Field of study

The evolution of the Web over the last three decades has led to a deluge of scientific and news articles on the Internet. Harnessing these publications in different fields of study is critical to effective end user information consumption. Similarly, in the domain of healthcare, one of the key challenges with the adoption of Electronic Health Records (EHRs) for clinical practice has been the tremendous amount of clinical notes generated that can be summarized without which clinical decision making and communication will be inefficient and costly. In spite of the rapid advances in information retrieval and deep learning techniques towards abstractive document summarization, the results of these efforts continue to resemble extractive summaries, achieving promising results predominantly on lexical metrics but performing poorly on semantic metrics. Thus, abstractive summarization that is driven by intrinsic and extrinsic semantics of documents is not adequately explored. Resources that can be used for generating semantics-driven abstractive summaries include: • Abstracts of multiple scientific articles published in a given technical field of study to generate an abstractive summary for topically-related abstracts within the field, thus reducing the load of having to read semantically duplicate abstracts on a given topic. • Citation contexts from different authoritative papers citing a reference paper can be used to generate utility-oriented abstractive summary for a scientific article. • Biomedical articles and the named entities characterizing the biomedical articles along with background knowledge bases to generate entity and fact-aware abstractive summaries. • Clinical notes of patients and clinical knowledge bases for abstractive clinical text summarization using knowledge-driven multi-objective optimization. In this dissertation, we develop semantics-driven abstractive models based on intra- document and inter-document semantic analyses along with facts of named entities retrieved from domain-specific knowledge bases to produce summaries. Concretely, we propose a sequence of frameworks leveraging semantics at various granularity (e.g., word, sentence, document, topic, citations, and named entities) levels, by utilizing external resources. The proposed frameworks have been applied to a range of tasks including 1. Abstractive summarization of topic-centric multi-document scientific articles and news articles. 2. Abstractive summarization of scientific articles using crowd-sourced citation contexts. 3. Abstractive summarization of biomedical articles clustered based on entity-relatedness. 4. Abstractive summarization of clinical notes of patients with heart failure and Chest X-Rays recordings. The proposed approaches achieve impressive performance in terms of preserving semantics in abstractive summarization while paraphrasing. For summarization of topic-centric multiple scientific/news articles, we propose a three-stage approach where abstracts of scientific articles or news articles are clustered based on their topical similarity determined from topics generated using Latent Dirichlet Allocation (LDA), followed by extractive phase and abstractive phase. Then, in the next stage, we focus on abstractive summarization of biomedical literature where we leverage named entities in biomedical articles to 1) cluster related articles; and 2) leverage the named entities towards guiding abstractive summarization. Finally, in the last stage, we turn to external resources such as citation contexts pointing to a scientific article to generate a comprehensive and utility-centric abstractive summary of a scientific article, domain-specific knowledge bases to fill gaps in information about entities in a biomedical article to summarize and clinical notes to guide abstractive summarization of clinical text. Thus, the bottom-up progression of exploring semantics towards abstractive summarization in this dissertation starts with (i) Semantic Analysis of Latent Topics; builds on (ii) Internal and External Knowledge-I (gleaned from abstracts and Citation Contexts); and extends it to make it comprehensive using (iii) Internal and External Knowledge-II (Named Entities and Knowledge Bases)

CORE

Recommended from our members

Grounded and Consistent Question Answering

Author: Alberti Christopher Brian
Publication venue
Publication date: 01/01/2023
Field of study

This thesis describes advancements in question answering along three general directions: model architecture extensions, explainable question answering, and data augmentation. Chapter 2 describes the first state-of-the-art model for the Natural Questions dataset based on pretrained transformers. Chapters 3 and 4 describe extensions to the model architecture designed to accommodate long textual inputs and multimodal text+image inputs, establishing new state-of-the-art results on the Natural Questions and on the VCR dataset. Chapter 5 shows that significant improvements can be obtained with data augmentation on the SQuAD and Natural Questions dataset, introducing roundtrip consistency as a simple heuristic to improve the quality of synthetic data. In Chapters 6 and 7 we explore explainable question answering, demonstrating the usefulness of a new concrete kind of structured explanations, QED, and proposing a semantic analysis of why-questions in the Natural Questions, as a way of better understanding the nature of real world explanations. Finally, in Chapters 8 and 9 we delve into more exploratory data augmentation techniques for question answering. We look respectively at how straight-through gradients can be utilized to optimize roundtrip consistency in a pipeline of models on the fly, and at how very recent large language models like PaLM can be used to generate synthetic question answering datasets for new languages given as few as five representative examples per language

Columbia University Academic Commons

Pretrained Transformers for Text Ranking: BERT and Beyond

Author: Lin Jimmy
Nogueira Rodrigo
Yates Andrew
Publication venue
Publication date: 01/01/2020
Field of study

The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

arXiv.org e-Print Archive

MPG.PuRe

Natural Language Processing: Emerging Neural Approaches and Applications

Author
Publication venue: 'MDPI AG'
Publication date: 06/05/2022
Field of study

This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

Directory of Open Access Books (DOAB)