211 research outputs found
Exploring Prediction Uncertainty in Machine Translation Quality Estimation
Machine Translation Quality Estimation is a notoriously difficult task, which
lessens its usefulness in real-world translation environments. Such scenarios
can be improved if quality predictions are accompanied by a measure of
uncertainty. However, models in this task are traditionally evaluated only in
terms of point estimate metrics, which do not take prediction uncertainty into
account. We investigate probabilistic methods for Quality Estimation that can
provide well-calibrated uncertainty estimates and evaluate them in terms of
their full posterior predictive distributions. We also show how this posterior
information can be useful in an asymmetric risk scenario, which aims to capture
typical situations in translation workflows.Comment: Proceedings of CoNLL 201
Bridging the gap between folksonomies and the semantic web: an experience report
Abstract. While folksonomies allow tagging of similar resources with a variety of tags, their content retrieval mechanisms are severely hampered by being agnostic to the relations that exist between these tags. To overcome this limitation, several methods have been proposed to find groups of implicitly inter-related tags. We believe that content retrieval can be further improved by making the relations between tags explicit. In this paper we propose the semantic enrichment of folksonomy tags with explicit relations by harvesting the Semantic Web, i.e., dynamically selecting and combining relevant bits of knowledge from online ontologies. Our experimental results show that, while semantic enrichment needs to be aware of the particular characteristics of folksonomies and the Semantic Web, it is beneficial for both.
Complex Word Identification: Challenges in Data Annotation and System Performance
This paper revisits the problem of complex word identification (CWI)
following up the SemEval CWI shared task. We use ensemble classifiers to
investigate how well computational methods can discriminate between complex and
non-complex words. Furthermore, we analyze the classification performance to
understand what makes lexical complexity challenging. Our findings show that
most systems performed poorly on the SemEval CWI dataset, and one of the
reasons for that is the way in which human annotation was performed.Comment: Proceedings of the 4th Workshop on NLP Techniques for Educational
Applications (NLPTEA 2017
Multi-modal Context Modelling for Machine Translation
MultiMT is an European Research Council Starting Grant whose aim is to devise data, methods and algorithms to exploit multi-modal information (images, audio, metadata) for context modelling in machine translation and other cross-lingual tasks. The project draws upon different research fields including natural language processing, computer vision, speech processing and machine learning
Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words
Conference paper: Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Word
Revisiting Contextual Toxicity Detection in Conversations
Understanding toxicity in user conversations is undoubtedly an important
problem. Addressing "covert" or implicit cases of toxicity is particularly hard
and requires context. Very few previous studies have analysed the influence of
conversational context in human perception or in automated detection models. We
dive deeper into both these directions. We start by analysing existing
contextual datasets and come to the conclusion that toxicity labelling by
humans is in general influenced by the conversational structure, polarity and
topic of the context. We then propose to bring these findings into
computational detection models by introducing and evaluating (a) neural
architectures for contextual toxicity detection that are aware of the
conversational structure, and (b) data augmentation strategies that can help
model contextual toxicity detection. Our results have shown the encouraging
potential of neural architectures that are aware of the conversation structure.
We have also demonstrated that such models can benefit from synthetic data,
especially in the social media domain
- …