Search CORE

3,154 research outputs found

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Author: Angelard-Gontier Nicolas
Bengio Yoshua
Lowe Ryan
Noseworthy Michael
Pineau Joelle
Serban Iulian V.
Publication venue
Publication date: 01/01/2017
Field of study

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.Comment: ACL 201

arXiv.org e-Print Archive

The Effects of Twitter Sentiment on Stock Price Returns

Author: A Gross-Klussmann
A Vespignani
AC MacKinlay
AG Haldane
B Pang
B Sluban
BG Malkiel
C Curme
C Vega
CWJ Granger
Darko Aleksovski
DM Cutler
E Boehmer
F Lillo
F Schweitzer
G Birz
G King
Gabriele Ranco
Guido Caldarelli
HS Moat
I Bordino
I Zheludev
Igor Mozetič
IH Witten
J Bollen
JE Engelberg
JP Bouchaud
JY Campbell
L Gaudette
L Kristoufek
M Alanyali
M Graham
M Juršič
M Piškorec
Miha Grčar
PC Tetlock
PC Tetlock
S Kiritchenko
T Preis
T Varkman
TO Sprenger
TO Sprenger
Tobias Preis
VN Vapnik
WS Chan
Z Da
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Social media are increasingly reflecting and influencing behavior of other complex systems. In this paper we investigate the relations between a well-know micro-blogging platform Twitter and financial markets. In particular, we consider, in a period of 15 months, the Twitter volume and sentiment about the 30 stock companies that form the Dow Jones Industrial Average (DJIA) index. We find a relatively low Pearson correlation and Granger causality between the corresponding time series over the entire time period. However, we find a significant dependence between the Twitter sentiment and abnormal returns during the peaks of Twitter volume. This is valid not only for the expected Twitter volume peaks (e.g., quarterly announcements), but also for peaks corresponding to less obvious events. We formalize the procedure by adapting the well-known "event study" from economics and finance to the analysis of Twitter data. The procedure allows to automatically identify events as Twitter volume peaks, to compute the prevailing sentiment (positive or negative) expressed in tweets at these peaks, and finally to apply the "event study" methodology to relate them to stock returns. We show that sentiment polarity of Twitter peaks implies the direction of cumulative abnormal returns. The amount of cumulative abnormal returns is relatively low (about 1-2%), but the dependence is statistically significant for several days after the events

arXiv.org e-Print Archive

Directory of Open Access Journals

Digital repository of Slovenian research organizations

FigShare

Fine-grained human evaluation of neural versus phrase-based machine translation

Author: Klubička Filip
Sánchez-Cartagena Víctor M.
Toral Antonio
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2017
Field of study

We compare three approaches to statistical machine translation (pure phrase-based, factored phrase-based and neural) by performing a fine-grained manual evaluation via error annotation of the systems' outputs. The error types in our annotation are compliant with the multidimensional quality metrics (MQM), and the annotation is performed by two annotators. Inter-annotator agreement is high for such a task, and results show that the best performing system (neural) reduces the errors produced by the worst system (phrase-based) by 54%.Comment: 12 pages, 2 figures, The Prague Bulletin of Mathematical Linguistic

arXiv.org e-Print Archive

Proceedings - University of Groningen

Dissertations of the University of Groningen

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

Author: Augenstein Isabelle
Das Mrinal
McCallum Andrew
Riedel Sebastian
Vikraman Lakshmi
Publication venue
Publication date: 01/01/2017
Field of study

We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities

arXiv.org e-Print Archive