93 research outputs found

    Deep Learning for Text Style Transfer: A Survey

    Full text link
    Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing, and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this paper, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of this task. Our curated paper list is at https://github.com/zhijing-jin/Text_Style_Transfer_SurveyComment: Computational Linguistics Journal 202

    Exploration and adaptation of large language models for specialized domains

    Get PDF
    Large language models have transformed the field of natural language processing (NLP). Their improved performance on various NLP benchmarks makes them a promising tool—also for the application in specialized domains. Such domains are characterized by highly trained professionals with particular domain expertise. Since these experts are rare, improving the efficiency of their work with automated systems is especially desirable. However, domain-specific text resources hold various challenges for NLP systems. These challenges include distinct language, noisy and scarce data, and a high level of variation. Further, specialized domains present an increased need for transparent systems since they are often applied in high stakes settings. In this dissertation, we examine whether large language models (LLMs) can overcome some of these challenges and propose methods to effectively adapt them to domain-specific requirements. We first investigate the inner workings and abilities of LLMs and show how they can fill the gaps that are present in previous NLP algorithms for specialized domains. To this end, we explore the sources of errors produced by earlier systems to identify which of them can be addressed by using LLMs. Following this, we take a closer look at how information is processed within Transformer-based LLMs to better understand their capabilities. We find that their layers encode different dimensions of the input text. Here, the contextual vector representation, and the general language knowledge learned during pre-training are especially beneficial for solving complex and multi-step tasks common in specialized domains. Following this exploration, we propose solutions for further adapting LLMs to the requirements of domain-specific tasks. We focus on the clinical domain, which incorporates many typical challenges found in specialized domains. We show how to improve generalization by integrating different domain-specific resources into our models. We further analyze the behavior of the produced models and propose a behavioral testing framework that can serve as a tool for communication with domain experts. Finally, we present an approach for incorporating the benefits of LLMs while fulfilling requirements such as interpretability and modularity. The presented solutions show improvements in performance on benchmark datasets and in manually conducted analyses with medical professionals. Our work provides both new insights into the inner workings of pre-trained language models as well as multiple adaptation methods showing that LLMs can be an effective tool for NLP in specialized domains

    Sentiment Analysis for Fake News Detection

    Get PDF
    [Abstract] In recent years, we have witnessed a rise in fake news, i.e., provably false pieces of information created with the intention of deception. The dissemination of this type of news poses a serious threat to cohesion and social well-being, since it fosters political polarization and the distrust of people with respect to their leaders. The huge amount of news that is disseminated through social media makes manual verification unfeasible, which has promoted the design and implementation of automatic systems for fake news detection. The creators of fake news use various stylistic tricks to promote the success of their creations, with one of them being to excite the sentiments of the recipients. This has led to sentiment analysis, the part of text analytics in charge of determining the polarity and strength of sentiments expressed in a text, to be used in fake news detection approaches, either as a basis of the system or as a complementary element. In this article, we study the different uses of sentiment analysis in the detection of fake news, with a discussion of the most relevant elements and shortcomings, and the requirements that should be met in the near future, such as multilingualism, explainability, mitigation of biases, or treatment of multimedia elements.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2020/11This work has been funded by FEDER/Ministerio de Ciencia, Innovación y Universidades — Agencia Estatal de Investigación through the ANSWERASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the Consellería de Educación, Universidade e Formación Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the Secretaría Xeral de Universidades (ref. ED431G 2019/01). David Vilares is also supported by a 2020 Leonardo Grant for Researchers and Cultural Creators from the BBVA Foundation. Carlos Gómez-Rodríguez has also received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant No. 714150

    A Call for Standardization and Validation of Text Style Transfer Evaluation

    Full text link
    Text Style Transfer (TST) evaluation is, in practice, inconsistent. Therefore, we conduct a meta-analysis on human and automated TST evaluation and experimentation that thoroughly examines existing literature in the field. The meta-analysis reveals a substantial standardization gap in human and automated evaluation. In addition, we also find a validation gap: only few automated metrics have been validated using human experiments. To this end, we thoroughly scrutinize both the standardization and validation gap and reveal the resulting pitfalls. This work also paves the way to close the standardization and validation gap in TST evaluation by calling out requirements to be met by future research.Comment: Accepted to Findings of ACL 202

    Datasets for Large Language Models: A Comprehensive Survey

    Full text link
    This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.Comment: 181 pages, 21 figure

    Multilingual representations and models for improved low-resource language processing

    Get PDF
    Word representations are the cornerstone of modern NLP. Representing words or characters using real-valued vectors as static representations that can capture the Semantics and encode the meaning has been popular among researchers. In more recent years, Pretrained Language Models using large amounts of data and creating contextualized representations achieved great performance in various tasks such as Semantic Role Labeling. These large pretrained language models are capable of storing and generalizing information and can be used as knowledge bases. Language models can produce multilingual representations while only using monolingual data during training. These multilingual representations can be beneficial in many tasks such as Machine Translation. Further, knowledge extraction models that only relied on information extracted from English resources, can now benefit from extra resources in other languages. Although these results were achieved for high-resource languages, there are thousands of languages that do not have large corpora. Moreover, for other tasks such as machine translation, if large monolingual data is not available, the models need parallel data, which is scarce for most languages. Further, many languages lack tokenization models, and splitting the text into meaningful segments such as words is not trivial. Although using subwords helps the models to have better coverage over unseen data and new words in the vocabulary, generalizing over low-resource languages with different alphabets and grammars is still a challenge. This thesis investigates methods to overcome these issues for low-resource languages. In the first publication, we explore the degree of multilinguality in multilingual pretrained language models. We demonstrate that these language models can produce high-quality word alignments without using parallel training data, which is not available for many languages. In the second paper, we extract word alignments for all available language pairs in the public bible corpus (PBC). Further, we created a tool for exploring these alignments which are especially helpful in studying low-resource languages. The third paper investigates word alignment in multiparallel corpora and exploits graph algorithms for extracting new alignment edges. In the fourth publication, we propose a new model to iteratively generate cross-lingual word embeddings and extract word alignments when only small parallel corpora are available. Lastly, the fifth paper finds that aggregation of different granularities of text can improve word alignment quality. We propose using subword sampling to produce such granularities
    • …
    corecore