30 research outputs found

    Self-supervised learning in natural language processing

    Get PDF
    Most natural language processing (NLP) learning algorithms require labeled data. While this is given for a select number of (mostly English) tasks, the availability of labeled data is sparse or non-existent for the vast majority of use-cases. To alleviate this, unsupervised learning and a wide array of data augmentation techniques have been developed (Hedderich et al., 2021a). However, unsupervised learning often requires massive amounts of unlabeled data and also fails to perform in difficult (low-resource) data settings, i.e., if there is an increased distance between the source and target data distributions (Kim et al., 2020). This distributional distance can be the case if there is a domain drift or large linguistic distance between the source and target data. Unsupervised learning in itself does not exploit the highly informative (labeled) supervisory signals hidden in unlabeled data. In this dissertation, we show that by combining the right unsupervised auxiliary task (e.g., sentence pair extraction) with an appropriate primary task (e.g., machine translation), self-supervised learning can exploit these hidden supervisory signals more efficiently than purely unsupervised approaches, while functioning on less labeled data than supervised approaches. Our self-supervised learning approach can be used to learn NLP tasks in an efficient manner, even when the amount of training data is sparse or the data comes with strong differences in its underlying distribution, e.g., stemming from unrelated languages. For our general approach, we applied unsupervised learning as an auxiliary task to learn a supervised primary task. Concretely, we have focused on the auxiliary task of sentence pair extraction for sequence-to-sequence primary tasks (i.e., machine translation and style transfer) as well as language modeling, clustering, subspace learning and knowledge integration for primary classification tasks (i.e., hate speech detection and sentiment analysis). For sequence-to-sequence tasks, we show that self-supervised neural machine translation (NMT) achieves competitive results on high-resource language pairs in comparison to unsupervised NMT while requiring less data. Further combining self-supervised NMT with unsupervised NMT-inspired augmentation techniques makes the learning of low-resource (similar, distant and unrelated) language pairs possible. Further, using our self-supervised approach, we show how style transfer can be learned without the need for parallel data, generating stylistic rephrasings of highest overall performance on all tested tasks. For sequence-to-label tasks, we underline the benefit of auxiliary task-based augmentation over primary task augmentation. An auxiliary task that showed to be especially beneficial to the primary task performance was subspace learning, which led to impressive gains in (cross-lingual) zero-shot classification performance on similar or distant target tasks, also on similar, distant and unrelated languages.Die meisten Lernalgorithmen der Computerlingistik (CL) benötigen gelabelte Daten. Diese sind zwar für eine Auswahl an (hautpsächlich Englischen) Aufgaben verfügbar, für den Großteil aller Anwendungsfälle sind gelabelte Daten jedoch nur spärrlich bis gar nicht vorhanden. Um dem gegenzusteuern, wurde eine große Auswahl an Techniken entwickelt, welche sich das unüberwachte Lernen oder Datenaugmentierung zu eigen machen (Hedderich et al., 2021a). Unüberwachtes Lernen benötigt jedoch massive Mengen an ungelabelten Daten und versagt, wenn es mit schwierigen (resourcenarmen) Datensituationen konfrontiert wird, d.h. wenn eine größere Distanz zwischen der Quellen- und Zieldatendistributionen vorhanden ist (Kim et al., 2020). Eine distributionelle Distanz kann zum Beispiel der Fall sein, wenn ein Domänenunterschied oder eine größere sprachliche Distanz zwischen der Quellenund Zieldaten besteht. Unüberwachtes Lernen selbst nutzt die hochinformativen (gelabelten) Überwachungssignale, welche sich in ungelabelte Daten verstecken, nicht aus. In dieser Dissertation zeigen wir, dass selbstüberwachtes Lernen, durch die Kombination der richtigen unüberwachten Hilfsaufgabe (z.B. Satzpaarextraktion) mit einer passenden Hauptaufgabe (z.B. maschinelle Übersetzung), diese versteckten Überwachsungssignale effizienter ausnutzen kann als pure unüberwachte Lernalgorithmen, und dabei auch noch weniger gelabelte Daten benötigen als überwachte Lernalgorithmen. Unser selbstüberwachter Lernansatz erlaubt es uns, CL Aufgaben effizient zu lernen, selbst wenn die Trainingsdatenmenge spärrlich ist oder die Daten mit starken distributionellen Differenzen einher gehen, z.B. weil die Daten von zwei nicht verwandten Sprachen stammen. Im Generellen haben wir unüberwachtes Lernen als Hilfsaufgabe angewandt um eine überwachte Hauptaufgabe zu erlernen. Konkret haben wir uns auf Satzpaarextraktion als Hilfsaufgabe für Sequenz-zu-Sequenz Hauptaufgaben (z.B. maschinelle Übersetzung und Stilübertragung) konzentriert sowohl als auch Sprachmodelierung, Clustern, Teilraumlernen und Wissensintegration zum erlernen von Klassifikationsaufgaben (z.B. Hassredenidentifikation und Sentimentanalyse). Für Sequenz-zu-Sequenz Aufgaben zeigen wir, dass selbstüberwachte maschinelle Übersetzung (MÜ) im Vergleich zur unüberwachten MÜ wettbewerbsfähige Ergebnisse auf resourcenreichen Sprachpaaren erreicht und währenddessen weniger Daten zum Lernen benötigt. Wenn selbstüberwachte MÜ mit Augmentationstechniken, inspiriert durch unüberwachte MÜ, kombiniert wird, wird auch das Lernen von resourcenarmen (ähnlichen, entfernt verwandten und nicht verwandten) Sprachpaaren möglich. Außerdem zeigen wir, wie unser selbsüberwachter Lernansatz es ermöglicht Stilübertragung ohne parallele Daten zu erlernen und dabei stylistische Umformulierungen von höchster Qualität auf allen geprüften Aufgaben zu erlangen. Für Sequenz-zu-Label Aufgaben unterstreichen wir den Vorteil, welchen hilfsaufgabenseitige Augmentierung über hauptaufgabenseitige Augmentierung hat. Eine Hilfsaufgabe welche sich als besonders hilfreich für die Qualität der Hauptaufgabe herausstellte ist das Teilraumlernen, welches zu beeindruckenden Leistungssteigerungen für (sprachübergreifende) zero-shot Klassifikation ähnlicher und entfernter Zielaufgaben (auch für ähnliche, entfernt verwandte und nicht verwandte Sprachen) führt

    HUMAN: Hierarchical Universal Modular ANnotator

    Full text link
    A lot of real-world phenomena are complex and cannot be captured by single task annotations. This causes a need for subsequent annotations, with interdependent questions and answers describing the nature of the subject at hand. Even in the case a phenomenon is easily captured by a single task, the high specialisation of most annotation tools can result in having to switch to another tool if the task only slightly changes. We introduce HUMAN, a novel web-based annotation tool that addresses the above problems by a) covering a variety of annotation tasks on both textual and image data, and b) the usage of an internal deterministic state machine, allowing the researcher to chain different annotation tasks in an interdependent manner. Further, the modular nature of the tool makes it easy to define new annotation tasks and integrate machine learning algorithms e.g., for active learning. HUMAN comes with an easy-to-use graphical user interface that simplifies the annotation task and management.Comment: 7 pages, 4 figures, EMNLP - Demonstrations 202

    Exploring Conditional Language Model Based Data Augmentation Approaches For Hate Speech Classification

    Get PDF
    International audienceDeep Neural Network (DNN) based classifiers have gained increased attention in hate speech classification. However, the performance of DNN classifiers increases with quantity of available training data and in reality, hate speech datasets consist of only a small amount of labeled data. To counter this, Data Augmentation (DA) techniques are often used to increase the number of labeled samples and therefore, improve the classifier's performance. In this article, we explore augmentation of training samples using a conditional language model. Our approach uses a single class conditioned Generative Pre-Trained Transformer-2 (GPT-2) language model for DA, avoiding the need for multiple class specific GPT-2 models. We study the effect of increasing the quantity of the augmented data and show that adding a few hundred samples significantly improves the classifier's performance. Furthermore, we evaluate the effect of filtering the generated data used for DA. Our approach demonstrates up to 7.3% and up to 25.0% of relative improvements in macro-averaged F1 on two widely used hate speech corpora

    Label Propagation-Based Semi-Supervised Learning for Hate Speech Classification

    Get PDF
    International audienceResearch on hate speech classification has received increased attention. In real-life scenarios , a small amount of labeled hate speech data is available to train a reliable classifier. Semi-supervised learning takes advantage of a small amount of labeled data and a large amount of unlabeled data. In this paper, label propagation-based semi-supervised learning is explored for the task of hate speech classification. The quality of labeling the unla-beled set depends on the input representations. In this work, we show that pre-trained representations are label agnostic, and when used with label propagation yield poor results. Neu-ral network-based fine-tuning can be adopted to learn task-specific representations using a small amount of labeled data. We show that fully fine-tuned representations may not always be the best representations for the label propagation and intermediate representations may perform better in a semi-supervised setup

    Challenges in assessing and managing multi-hazard risks: a European stakeholders perspective

    Get PDF
    The latest evidence suggests that multi-hazards and their interrelationships (e.g., triggering, compound, and consecutive hazards) are becoming more frequent across Europe, underlying a need for resilience building by moving from single-hazard-focused to multi-hazard risk assessment and management. Although significant advancements were made in our understanding of these events, mainstream practice is still focused on risks due to single hazards (e.g., flooding, earthquakes, droughts), with a limited understanding of the stakeholder needs on the ground. To overcome this limitation, this paper sets out to understand the challenges for moving towards multi-hazard risk management through the perspective of European stakeholders. Based on five workshops across different European pilots (Danube Region, Veneto Region, Scandinavia, North Sea, and Canary Islands) and an expert workshop, we identify five prime challenges: i) governance, ii) knowledge of multi-hazards and multi-risks, iii) existing approaches to disaster risk management, iv) translation of science to policy and practice, and v) lack of data. These challenges are inherently linked and cannot be tackled in isolation with path dependency posing a significant hurdle in transitioning from single- to multi-hazard risk management. Going forward, we identify promising approaches for overcoming some of the challenges, including emerging approaches for multi-hazard characterisation, a common understanding of terminology, and a comprehensive framework for guiding multi-hazard risk assessment and management. We argue for a need to think beyond natural hazards and include other threats in creating a comprehensive overview of multi-hazard risks, as well as promoting thinking of multi-hazard risk reduction in the context of larger development goals

    D1.2 Handbook of multi-hazard, multi-risk definitions and concepts

    Get PDF
    This report is the first output of Work Package 1: Diagnosis of the MYRIAD-EU project: Handbook of Multi-hazard, Multi-Risk Definitions and Concepts. The aim of the task was to (i) acknowledge the differences and promote consistency in understanding across subsequent work packages in the MYRIAD-EU project, (ii) improve the accessibility of our work to a broad array of stakeholders and (iii) strengthen consensus across the hazard and risk community through a common understanding of multi-hazard, multi-risk terminology and concepts. The work encompassed a mixed-methods approach, including internal consultations and data-generating exercises; literature reviews; external stakeholder engagement; adopting and building on a rich existing body of established glossaries. 140 terms are included in the glossary, 102 related to multi-hazard, multi-risk, disaster risk management and an additional 38 due to their relevance to the project, acknowledging the need for a common understanding amongst an interdisciplinary project consortium. We also include extended definitions related to concepts particularly of relevance to this project deliverable, including ‘multi-hazard’, ‘hazard interrelationships’, ‘multi-risk’ and ‘direct and indirect loss and risk’. Underpinned by a literature review and internal consultation, we include a specific section on indicators, how these might be applied within a multi-hazard and multi-risk context, and how existing indicators could be adapted to consider multi-risk management. We emphasise that there are a number of established glossaries that the project (and risk community) should make use of to strengthen the impact of the work we do, noting in our literature review a tendency in papers and reports to define words afresh. We conclude the report with a selection of key observations, including terminology matters – for all aspects of disaster risk management, for example communication, data collection, measuring progress and reporting against Sendai Framework targets. At the same time, we discuss when is it helpful to include ‘multi-‘ as a prefix, questioning whether part of the paradigm shift needed to successfully address complex challenges facing an interconnected world is through inherently seeing vulnerability, exposure and disaster risk through the lens of multiple, interrelated hazards. We emphasise that there is likely to be an evolution of the terminology throughout the project lifetime as terms are emerge or shift as the project evolves. Finally, we propose a roadmap for developing and testing draft multi-risk indicators in MYRIAD-EU. The WP1 team would like to acknowledge all the contributions of the consortium on this task and the feedback from the External Advisory Board, in particular the chair of the board Virginia Murray, Head of Global Disaster Risk Reduction at the UK Health Security Agency, and the contribution of Jenty Kirsch-Wood, Head of Global Risk Management and Reporting at UNDRR, for her reflections on the findings of this work

    StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

    Full text link
    Analyzing ethnic or religious bias is important for improving fairness, accountability, and transparency of natural language processing models. However, many techniques rely on human-compiled lists of bias terms, which are expensive to create and are limited in coverage. In this study, we present a fully data-driven pipeline for generating a knowledge graph (KG) of cultural knowledge and stereotypes. Our resulting KG covers 5 religious groups and 5 nationalities and can easily be extended to include more entities. Our human evaluation shows that the majority (59.2%) of non-singleton entries are coherent and complete stereotypes. We further show that performing intermediate masked language model training on the verbalized KG leads to a higher level of cultural awareness in the model and has the potential to increase classification performance on knowledge-crucial samples on a related task, i.e., hate speech detection.Comment: 12 pages, 2 figures, accepted as a long paper at WOAH at NAACL 202

    HUMAN: Hierarchical Universal Modular ANnotator

    No full text
    7 pages, 4 figures, EMNLP - Demonstrations 2020International audienceA lot of real-world phenomena are complex and cannot be captured by single task annotations. This causes a need for subsequent annotations, with interdependent questions and answers describing the nature of the subject at hand. Even in the case a phenomenon is easily captured by a single task, the high specialisation of most annotation tools can result in having to switch to another tool if the task only slightly changes. We introduce HUMAN, a novel web-based annotation tool that addresses the above problems by a) covering a variety of annotation tasks on both textual and image data, and b) the usage of an internal deterministic state machine, allowing the researcher to chain different annotation tasks in an interdependent manner. Further, the modular nature of the tool makes it easy to define new annotation tasks and integrate machine learning algorithms e.g., for active learning. HUMAN comes with an easy-to-use graphical user interface that simplifies the annotation task and management
    corecore