9,772 research outputs found

    Detecting (Un)Important Content for Single-Document News Summarization

    Full text link
    We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the "beginning of document" heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline.Comment: Accepted By EACL 201

    Distantly Labeling Data for Large Scale Cross-Document Coreference

    Full text link
    Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling'' a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201

    Dataset for Automated Fact Checking in Czech Language

    Get PDF
    Naše práce prozkoumává existující datové sady pro úlohu automatického faktického ověřování textového tvrzení a navrhuje dvě metody jejich získávání v Českém jazyce. Nejprve předkládá rozsáhlý dataset FEVER CS se 127K anotovaných tvrzení pomocí strojového překladu datové sady v angličtině. Poté navrhuje sadu anotačních experimentů pro sběr nativního českého datasetu nad znalostní bází archivu ČTK a provádí ji se skupinou 163 studentů FSV UK, se ziskem 3,295 křížově anotovaných tvrzení s čtyřcestnou Fleissovou Kappa-shodou 0.63. Dále demonstruje vhodnost datové sady pro trénování modelů pro klasifikaci inference v přirozeném jazyce natrénováním modelu XLM-RoBERTa dosahujícího 85.5% mikro-F1 přesnosti v úloze klasifikace pravdivosti tvrzení z textového kontextu.Our work examines the existing datasets for the task of automated fact-verification of textual claims and proposes two methods of their acquisition in the low-resource Czech language. It first delivers a large-scale FEVER CS dataset of 127K annotated claims by applying the Machine Translation methods to a dataset available in English. It then designs a set of human-annotation experiments for collecting a novel dataset in Czech, using the ČTK Archive corpus for a knowledge base, and conducts them with a group of 163 students of FSS CUNI, yielding a dataset of 3,295 cross-annotated claims with a 4-way Fleiss' Kappa-agreement of 0.63. It then proceeds to show the eligibility of the dataset for training the Czech Natural Language Inference models, training an XLM-RoBERTa model scoring 85.5% micro-F1 in the task of classifying the claim veracity given textual evidence

    REDAffectiveLM: Leveraging Affect Enriched Embedding and Transformer-based Neural Language Model for Readers' Emotion Detection

    Full text link
    Technological advancements in web platforms allow people to express and share emotions towards textual write-ups written and shared by others. This brings about different interesting domains for analysis; emotion expressed by the writer and emotion elicited from the readers. In this paper, we propose a novel approach for Readers' Emotion Detection from short-text documents using a deep learning model called REDAffectiveLM. Within state-of-the-art NLP tasks, it is well understood that utilizing context-specific representations from transformer-based pre-trained language models helps achieve improved performance. Within this affective computing task, we explore how incorporating affective information can further enhance performance. Towards this, we leverage context-specific and affect enriched representations by using a transformer-based pre-trained language model in tandem with affect enriched Bi-LSTM+Attention. For empirical evaluation, we procure a new dataset REN-20k, besides using RENh-4k and SemEval-2007. We evaluate the performance of our REDAffectiveLM rigorously across these datasets, against a vast set of state-of-the-art baselines, where our model consistently outperforms baselines and obtains statistically significant results. Our results establish that utilizing affect enriched representation along with context-specific representation within a neural architecture can considerably enhance readers' emotion detection. Since the impact of affect enrichment specifically in readers' emotion detection isn't well explored, we conduct a detailed analysis over affect enriched Bi-LSTM+Attention using qualitative and quantitative model behavior evaluation techniques. We observe that compared to conventional semantic embedding, affect enriched embedding increases ability of the network to effectively identify and assign weightage to key terms responsible for readers' emotion detection
    • …
    corecore