238 research outputs found

    Learning to Identify Ambiguous and Misleading News Headlines

    Full text link
    Accuracy is one of the basic principles of journalism. However, it is increasingly hard to manage due to the diversity of news media. Some editors of online news tend to use catchy headlines which trick readers into clicking. These headlines are either ambiguous or misleading, degrading the reading experience of the audience. Thus, identifying inaccurate news headlines is a task worth studying. Previous work names these headlines "clickbaits" and mainly focus on the features extracted from the headlines, which limits the performance since the consistency between headlines and news bodies is underappreciated. In this paper, we clearly redefine the problem and identify ambiguous and misleading headlines separately. We utilize class sequential rules to exploit structure information when detecting ambiguous headlines. For the identification of misleading headlines, we extract features based on the congruence between headlines and bodies. To make use of the large unlabeled data set, we apply a co-training method and gain an increase in performance. The experiment results show the effectiveness of our methods. Then we use our classifiers to detect inaccurate headlines crawled from different sources and conduct a data analysis.Comment: Accepted by IJCAI 201

    Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection

    Full text link
    Modeling hypernymy, such as poodle is-a dog, is an important generalization aid to many NLP tasks, such as entailment, coreference, relation extraction, and question answering. Supervised learning from labeled hypernym sources, such as WordNet, limits the coverage of these models, which can be addressed by learning hypernyms from unlabeled text. Existing unsupervised methods either do not scale to large vocabularies or yield unacceptably poor accuracy. This paper introduces distributional inclusion vector embedding (DIVE), a simple-to-implement unsupervised method of hypernym discovery via per-word non-negative vector embeddings which preserve the inclusion property of word contexts in a low-dimensional and interpretable space. In experimental evaluations more comprehensive than any previous literature of which we are aware-evaluating on 11 datasets using multiple existing as well as newly proposed scoring functions-we find that our method provides up to double the precision of previous unsupervised embeddings, and the highest average performance, using a much more compact word representation, and yielding many new state-of-the-art results.Comment: NAACL 201

    Sociocultural Norm Similarities and Differences via Situational Alignment and Explainable Textual Entailment

    Full text link
    Designing systems that can reason across cultures requires that they are grounded in the norms of the contexts in which they operate. However, current research on developing computational models of social norms has primarily focused on American society. Here, we propose a novel approach to discover and compare descriptive social norms across Chinese and American cultures. We demonstrate our approach by leveraging discussions on a Chinese Q&A platform (Zhihu) and the existing SocialChemistry dataset as proxies for contrasting cultural axes, align social situations cross-culturally, and extract social norms from texts using in-context learning. Embedding Chain-of-Thought prompting in a human-AI collaborative framework, we build a high-quality dataset of 3,069 social norms aligned with social situations across Chinese and American cultures alongside corresponding free-text explanations. To test the ability of models to reason about social norms across cultures, we introduce the task of explainable social norm entailment, showing that existing models under 3B parameters have significant room for improvement in both automatic and human evaluation. Further analysis of cross-cultural norm differences based on our dataset shows empirical alignment with the social orientations framework, revealing several situational and descriptive nuances in norms across these cultures.Comment: EMNLP 2023 Main Conference (Long Paper

    Cross-Lingual Textual Entailment and Applications

    Get PDF
    Textual Entailment (TE) has been proposed as a generic framework for modeling language variability. The great potential of integrating (monolingual) TE recognition components into NLP architectures has been reported in several areas, such as question answering, information retrieval, information extraction and document summarization. Mainly due to the absence of cross-lingual TE (CLTE) recognition components, similar improvements have not yet been achieved in any corresponding cross-lingual application. In this thesis, we propose and investigate Cross-Lingual Textual Entailment (CLTE) as a semantic relation between two text portions in dierent languages. We present dierent practical solutions to approach this problem by i) bringing CLTE back to the monolingual scenario, translating the two texts into the same language; and ii) integrating machine translation and TE algorithms and techniques. We argue that CLTE can be a core tech- nology for several cross-lingual NLP applications and tasks. Experiments on dierent datasets and two interesting cross-lingual NLP applications, namely content synchronization and machine translation evaluation, conrm the eectiveness of our approaches leading to successful results. As a complement to the research in the algorithmic side, we successfully explored the creation of cross-lingual textual entailment corpora by means of crowdsourcing, as a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    IndoNLI : a Natural Language Inference Dataset for Indonesian

    Get PDF
    We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect ~18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerical reasoning, structural changes, idioms, or temporal and spatial reasoning. Experiment results show that XLM-R outperforms other pre-trained models in our data. The best performance on the expert-annotated data is still far below human performance (13.4% accuracy gap), suggesting that this test set is especially challenging. Furthermore, our analysis shows that our expert-annotated data is more diverse and contains fewer annotation artifacts than the crowd-annotated data. We hope this dataset can help accelerate progress in Indonesian NLP research

    Identifying Semantic Divergences Across Languages

    Get PDF
    Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways. In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision. We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality

    AugCSE: contrastive sentence embedding with diverse augmentations

    Get PDF
    Data augmentation techniques have been proven useful in many applications in NLP fields. Most augmentations are task-specific, and cannot be used as a general-purpose tool. In our work, we present AugCSE, a unified framework to utilize diverse sets of data augmentations to achieve a better, general-purpose, sentence embedding model. Building upon the latest sentence embedding models, our approach uses a simple antagonistic discriminator that differentiates the augmentation types. With the finetuning objective borrowed from domain adaptation, we show that diverse augmentations, which often lead to conflicting contrastive signals, can be tamed to produce a better and more robust sentence representation. Our methods achieve state-of-the-art results on downstream transfer tasks and perform competitively on semantic textual similarity tasks, using only unsupervised data.000000000000000000000000000000000000000000000000000000010241 - University of California, Berkeleyhttps://aclanthology.org/2022.aacl-main.30/First author draf

    From Discourse Structure To Text Specificity: Studies Of Coherence Preferences

    Get PDF
    To successfully communicate through text, a writer needs to organize information into an understandable and well-structured discourse for the targeted audience. This involves deciding when to convey general statements, when to elaborate on details, and gauging how much details to convey, i.e., the level of specificity. This thesis explores the automatic prediction of text specificity, and whether the perception of specificity varies across different audiences. We characterize text specificity from two aspects: the instantiation discourse relation, and the specificity of sentences and words. We identify characteristics of instantiation that signify a change of specificity between sentences. Features derived from these characteristics substantially improve the detection of the relation. Using instantiation sentences as the basis for training, we propose a semi-supervised system to predict sentence specificity with speed and accuracy. Furthermore, we present insights into the effect of underspecified words and phrases on the comprehension of text, and the prediction of such words. We show distinct preferences in specificity and discourse structure among different audiences. We investigate these distinctions in both cross-lingual and monolingual context. Cross-lingually, we identify discourse factors that significantly impact the quality of text translated from Chinese to English. Notably, a large portion of Chinese sentences are significantly more specific and need to be translated into multiple English sentences. We introduce a system using rich syntactic features to accurately detect such sentences. We also show that simplified text is more general, and that specific sentences are more likely to need simplification. Finally, we present evidence that the perception of sentence specificity differs among male and female readers

    Exploiting Cross-Lingual Representations For Natural Language Processing

    Get PDF
    Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest
    corecore