9,772 research outputs found
Detecting (Un)Important Content for Single-Document News Summarization
We present a robust approach for detecting intrinsic sentence importance in
news, by training on two corpora of document-summary pairs. When used for
single-document summarization, our approach, combined with the "beginning of
document" heuristic, outperforms a state-of-the-art summarizer and the
beginning-of-article baseline in both automatic and manual evaluations. These
results represent an important advance because in the absence of cross-document
repetition, single document summarizers for news have not been able to
consistently outperform the strong beginning-of-article baseline.Comment: Accepted By EACL 201
Distantly Labeling Data for Large Scale Cross-Document Coreference
Cross-document coreference, the problem of resolving entity mentions across
multi-document collections, is crucial to automated knowledge base construction
and data mining tasks. However, the scarcity of large labeled data sets has
hindered supervised machine learning research for this task. In this paper we
develop and demonstrate an approach based on ``distantly-labeling'' a data set
from which we can train a discriminative cross-document coreference model. In
particular we build a dataset of more than a million people mentions extracted
from 3.5 years of New York Times articles, leverage Wikipedia for distant
labeling with a generative model (and measure the reliability of such
labeling); then we train and evaluate a conditional random field coreference
model that has factors on cross-document entities as well as mention-pairs.
This coreference model obtains high accuracy in resolving mentions and entities
that are not present in the training data, indicating applicability to
non-Wikipedia data. Given the large amount of data, our work is also an
exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201
Dataset for Automated Fact Checking in Czech Language
Naše práce prozkoumává existujĂcĂ datovĂ© sady pro Ăşlohu automatickĂ©ho faktickĂ©ho ověřovánĂ textovĂ©ho tvrzenĂ a navrhuje dvÄ› metody jejich zĂskávánĂ v ÄŚeskĂ©m jazyce. Nejprve pĹ™edkládá rozsáhlĂ˝ dataset FEVER CS se 127K anotovanĂ˝ch tvrzenĂ pomocĂ strojovĂ©ho pĹ™ekladu datovĂ© sady v angliÄŤtinÄ›. PotĂ© navrhuje sadu anotaÄŤnĂch experimentĹŻ pro sbÄ›r nativnĂho ÄŤeskĂ©ho datasetu nad znalostnĂ bázĂ archivu ÄŚTK a provádĂ ji se skupinou 163 studentĹŻ FSV UK, se ziskem 3,295 kĹ™ĂĹľovÄ› anotovanĂ˝ch tvrzenĂ s ÄŤtyĹ™cestnou Fleissovou Kappa-shodou 0.63. Dále demonstruje vhodnost datovĂ© sady pro trĂ©novánĂ modelĹŻ pro klasifikaci inference v pĹ™irozenĂ©m jazyce natrĂ©novánĂm modelu XLM-RoBERTa dosahujĂcĂho 85.5% mikro-F1 pĹ™esnosti v Ăşloze klasifikace pravdivosti tvrzenĂ z textovĂ©ho kontextu.Our work examines the existing datasets for the task of automated fact-verification of textual claims and proposes two methods of their acquisition in the low-resource Czech language. It first delivers a large-scale FEVER CS dataset of 127K annotated claims by applying the Machine Translation methods to a dataset available in English. It then designs a set of human-annotation experiments for collecting a novel dataset in Czech, using the ÄŚTK Archive corpus for a knowledge base, and conducts them with a group of 163 students of FSS CUNI, yielding a dataset of 3,295 cross-annotated claims with a 4-way Fleiss' Kappa-agreement of 0.63. It then proceeds to show the eligibility of the dataset for training the Czech Natural Language Inference models, training an XLM-RoBERTa model scoring 85.5% micro-F1 in the task of classifying the claim veracity given textual evidence
REDAffectiveLM: Leveraging Affect Enriched Embedding and Transformer-based Neural Language Model for Readers' Emotion Detection
Technological advancements in web platforms allow people to express and share
emotions towards textual write-ups written and shared by others. This brings
about different interesting domains for analysis; emotion expressed by the
writer and emotion elicited from the readers. In this paper, we propose a novel
approach for Readers' Emotion Detection from short-text documents using a deep
learning model called REDAffectiveLM. Within state-of-the-art NLP tasks, it is
well understood that utilizing context-specific representations from
transformer-based pre-trained language models helps achieve improved
performance. Within this affective computing task, we explore how incorporating
affective information can further enhance performance. Towards this, we
leverage context-specific and affect enriched representations by using a
transformer-based pre-trained language model in tandem with affect enriched
Bi-LSTM+Attention. For empirical evaluation, we procure a new dataset REN-20k,
besides using RENh-4k and SemEval-2007. We evaluate the performance of our
REDAffectiveLM rigorously across these datasets, against a vast set of
state-of-the-art baselines, where our model consistently outperforms baselines
and obtains statistically significant results. Our results establish that
utilizing affect enriched representation along with context-specific
representation within a neural architecture can considerably enhance readers'
emotion detection. Since the impact of affect enrichment specifically in
readers' emotion detection isn't well explored, we conduct a detailed analysis
over affect enriched Bi-LSTM+Attention using qualitative and quantitative model
behavior evaluation techniques. We observe that compared to conventional
semantic embedding, affect enriched embedding increases ability of the network
to effectively identify and assign weightage to key terms responsible for
readers' emotion detection
- …