813 research outputs found
Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution
Winograd schemas are a well-established tool for evaluating coreference resolution (CoR) and commonsense reasoning (CSR) capabilities of computational models. So far, schemas remained largely confined to English, limiting their utility in multilingual settings. This work presents Wino-X, a parallel dataset of German, French, and Russian schemas, aligned with their English counterparts. We use this resource to investigate whether neural machine translation (NMT) models can perform CoR that requires commonsense knowledge and whether multilingual language models (MLLMs) are capable of CSR across multiple languages. Our findings show Wino-X to be exceptionally challenging for NMT systems that are prone to undesirable biases and unable to detect disambiguating information. We quantify biases using established statistical methods and define ways to address both of these issues. We furthermore present evidence of active cross-lingual knowledge transfer in MLLMs, whereby fine-tuning models on English schemas yields CSR improvements in other languages
CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks
Recent efforts in natural language processing (NLP) commonsense reasoning
research have yielded a considerable number of new datasets and benchmarks.
However, most of these datasets formulate commonsense reasoning challenges in
artificial scenarios that are not reflective of the tasks which real-world NLP
systems are designed to solve. In this work, we present CRoW, a
manually-curated, multi-task benchmark that evaluates the ability of models to
apply commonsense reasoning in the context of six real-world NLP tasks. CRoW is
constructed using a multi-stage data collection pipeline that rewrites examples
from existing datasets using commonsense-violating perturbations. We use CRoW
to study how NLP systems perform across different dimensions of commonsense
knowledge, such as physical, temporal, and social reasoning. We find a
significant performance gap when NLP systems are evaluated on CRoW compared to
humans, showcasing that commonsense reasoning is far from being solved in
real-world task settings. We make our dataset and leaderboard available to the
research community at https://github.com/mismayil/crow.Comment: 37 pages, camera-ready for EMNLP 202
Experience and Prediction: A Metric of Hardness for a Novel Litmus Test
In the last decade, the Winograd Schema Challenge (WSC) has become a central
aspect of the research community as a novel litmus test. Consequently, the WSC
has spurred research interest because it can be seen as the means to understand
human behavior. In this regard, the development of new techniques has made
possible the usage of Winograd schemas in various fields, such as the design of
novel forms of CAPTCHAs.
Work from the literature that established a baseline for human adult
performance on the WSC has shown that not all schemas are the same, meaning
that they could potentially be categorized according to their perceived
hardness for humans. In this regard, this \textit{hardness-metric} could be
used in future challenges or in the WSC CAPTCHA service to differentiate
between Winograd schemas.
Recent work of ours has shown that this could be achieved via the design of
an automated system that is able to output the hardness-indexes of Winograd
schemas, albeit with limitations regarding the number of schemas it could be
applied on. This paper adds to previous research by presenting a new system
that is based on Machine Learning (ML), able to output the hardness of any
Winograd schema faster and more accurately than any other previously used
method. Our developed system, which works within two different approaches,
namely the random forest and deep learning (LSTM-based), is ready to be used as
an extension of any other system that aims to differentiate between Winograd
schemas, according to their perceived hardness for humans. At the same time,
along with our developed system we extend previous work by presenting the
results of a large-scale experiment that shows how human performance varies
across Winograd schemas.Comment: 33 pages, 10 figures
Recommended from our members
The Theory of Correlation Formulas and Their Application to Discourse Coherence
The Winograd Schema Challenge (WSC) was proposed as a measure of machine intelligence. It boils down to anaphora resolution, a task familiar from computational linguistics. Research in linguistics and AI has coalesced around discourse coherence as the critical factor in solving this task, and the process of establishing discourse coherence relies fundamentally on world and commonsense knowledge.
In this thesis, we build on an approach to establishing coherence on the basis of it correlation. The utility of this approach lies in its conceptual clarity and ability to flexibly represent commonsense knowledge. We work to fill some conceptual holes with the Correlation Calculus approach. First, understanding the calculus in a vacuum is not straightfoward unless it has a precise semantics. Second, existing demonstrations of the Correlation Calculus on Winograd Schema Challenge problems have not been linguistically credible.
We hope to ameliorate some---but by no means all---of the outstanding issues with the Correlation Calculus. We do so first by providing a precise semantics of the calculus, which relates our intuitive understanding of correlation with a precise notion involving probabilities. Second, we formulate the establishment of discourse coherence by correlation formulas within the framework of Discourse Representation Theory. This provides a more complete and linguistically credible account of the relationship between the Correlation Calculus, discourse coherence, and Winograd Schema Challenge problems.Computer Science
Interset: A natural language interface for teleoperated robotic assembly of the EASE space structure
A teleoperated robot was used to assemble the Experimental Assembly of Structures in Extra-vehicular activity (EASE) space structure under neutral buoyancy conditions, simulating a telerobot performing structural assembly in the zero gravity of space. This previous work used a manually controlled teleoperator as a test bed for system performance evaluations. From these results several Artificial Intelligence options were proposed. One of these was further developed into a real time assembly planner. The interface for this system is effective in assembling EASE structures using windowed graphics and a set of networked menus. As the problem space becomes more complex and hence the set of control options increases, a natural language interface may prove to be beneficial to supplement the menu based control strategy. This strategy can be beneficial in situations such as: describing the local environment, maintaining a data base of task event histories, modifying a plan or a heuristic dynamically, summarizing a task in English, or operating in a novel situation
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
We present a large-scale collection of diverse natural language inference
(NLI) datasets that help provide insight into how well a sentence
representation captures distinct types of reasoning. The collection results
from recasting 13 existing datasets from 7 semantic phenomena into a common NLI
structure, resulting in over half a million labeled context-hypothesis pairs in
total. We refer to our collection as the DNC: Diverse Natural Language
Inference Collection. The DNC is available online at https://www.decomp.net,
and will grow over time as additional resources are recast and added from novel
sources.Comment: To be presented at EMNLP 2018. 15 page
- …