611 research outputs found
Can Knowledge Rich Sentences Help Language Models To Solve Common Sense Reasoning Problems?
abstract: Significance of real-world knowledge for Natural Language Understanding(NLU) is well-known for decades. With advancements in technology, challenging tasks like question-answering, text-summarizing, and machine translation are made possible with continuous efforts in the field of Natural Language Processing(NLP). Yet, knowledge integration to answer common sense questions is still a daunting task. Logical reasoning has been a resort for many of the problems in NLP and has achieved considerable results in the field, but it is difficult to resolve the ambiguities in a natural language. Co-reference resolution is one of the problems where ambiguity arises due to the semantics of the sentence. Another such problem is the cause and result statements which require causal commonsense reasoning to resolve the ambiguity. Modeling these type of problems is not a simple task with rules or logic. State-of-the-art systems addressing these problems use a trained neural network model, which claims to have overall knowledge from a huge trained corpus. These systems answer the questions by using the knowledge embedded in their trained language model. Although the language models embed the knowledge from the data, they use occurrences of words and frequency of co-existing words to solve the prevailing ambiguity. This limits the performance of language models to solve the problems in common-sense reasoning task as it generalizes the concept rather than trying to answer the problem specific to its context. For example, "The painting in Mark's living room shows an oak tree. It is to the right of a house", is a co-reference resolution problem which requires knowledge. Language models can resolve whether "it" refers to "painting" or "tree", since "house" and "tree" are two common co-occurring words so the models can resolve "tree" to be the co-reference. On the other hand, "The large ball crashed right through the table. Because it was made of Styrofoam ." to resolve for "it" which can be either "table" or "ball", is difficult for a language model as it requires more information about the problem.
In this work, I have built an end-to-end framework, which uses the automatically extracted knowledge based on the problem. This knowledge is augmented with the language models using an explicit reasoning module to resolve the ambiguity. This system is built to improve the accuracy of the language models based approaches for commonsense reasoning. This system has proved to achieve the state of the art accuracy on the Winograd Schema Challenge.Dissertation/ThesisMasters Thesis Computer Science 201
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011),
a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun
resolution problems originally designed to be unsolvable for statistical models
that rely on selectional preferences or word associations. However, recent
advances in neural language models have already reached around 90% accuracy on
variants of WSC. This raises an important question whether these models have
truly acquired robust commonsense capabilities or whether they rely on spurious
biases in the datasets that lead to an overestimation of the true capabilities
of machine commonsense. To investigate this question, we introduce WinoGrande,
a large-scale dataset of 44k problems, inspired by the original WSC design, but
adjusted to improve both the scale and the hardness of the dataset. The key
steps of the dataset construction consist of (1) a carefully designed
crowdsourcing procedure, followed by (2) systematic bias reduction using a
novel AfLite algorithm that generalizes human-detectable word associations to
machine-detectable embedding associations. The best state-of-the-art methods on
WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of
94.0%, depending on the amount of the training data allowed. Furthermore, we
establish new state-of-the-art results on five related benchmarks - WSC
(90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%).
These results have dual implications: on one hand, they demonstrate the
effectiveness of WinoGrande when used as a resource for transfer learning. On
the other hand, they raise a concern that we are likely to be overestimating
the true capabilities of machine commonsense across all these benchmarks. We
emphasize the importance of algorithmic bias reduction in existing and future
benchmarks to mitigate such overestimation
Experience and Prediction: A Metric of Hardness for a Novel Litmus Test
In the last decade, the Winograd Schema Challenge (WSC) has become a central
aspect of the research community as a novel litmus test. Consequently, the WSC
has spurred research interest because it can be seen as the means to understand
human behavior. In this regard, the development of new techniques has made
possible the usage of Winograd schemas in various fields, such as the design of
novel forms of CAPTCHAs.
Work from the literature that established a baseline for human adult
performance on the WSC has shown that not all schemas are the same, meaning
that they could potentially be categorized according to their perceived
hardness for humans. In this regard, this \textit{hardness-metric} could be
used in future challenges or in the WSC CAPTCHA service to differentiate
between Winograd schemas.
Recent work of ours has shown that this could be achieved via the design of
an automated system that is able to output the hardness-indexes of Winograd
schemas, albeit with limitations regarding the number of schemas it could be
applied on. This paper adds to previous research by presenting a new system
that is based on Machine Learning (ML), able to output the hardness of any
Winograd schema faster and more accurately than any other previously used
method. Our developed system, which works within two different approaches,
namely the random forest and deep learning (LSTM-based), is ready to be used as
an extension of any other system that aims to differentiate between Winograd
schemas, according to their perceived hardness for humans. At the same time,
along with our developed system we extend previous work by presenting the
results of a large-scale experiment that shows how human performance varies
across Winograd schemas.Comment: 33 pages, 10 figures
Towards Understanding Natural Language: Semantic Parsing, Commonsense Knowledge Acquisition, Reasoning Framework and Applications
abstract: Reasoning with commonsense knowledge is an integral component of human behavior. It is due to this capability that people know that a weak person may not be able to lift someone. It has been a long standing goal of the Artificial Intelligence community to simulate such commonsense reasoning abilities in machines. Over the years, many advances have been made and various challenges have been proposed to test their abilities. The Winograd Schema Challenge (WSC) is one such Natural Language Understanding (NLU) task which was also proposed as an alternative to the Turing Test. It is made up of textual question answering problems which require resolution of a pronoun to its correct antecedent.
In this thesis, two approaches of developing NLU systems to solve the Winograd Schema Challenge are demonstrated. To this end, a semantic parser is presented, various kinds of commonsense knowledge are identified, techniques to extract commonsense knowledge are developed and two commonsense reasoning algorithms are presented. The usefulness of the developed tools and techniques is shown by applying them to solve the challenge.Dissertation/ThesisDoctoral Dissertation Computer Science 201
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
Commonsense AI has long been seen as a near impossible goal -- until
recently. Now, research interest has sharply increased with an influx of new
benchmarks and models.
We propose two new ways to evaluate commonsense models, emphasizing their
generality on new tasks and building on diverse, recently introduced
benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote
research on commonsense models that generalize well over multiple tasks and
datasets. Second, we propose a novel evaluation, the cost equivalent curve,
that sheds new insight on how the choice of source datasets, pretrained
language models, and transfer learning methods impacts performance and data
efficiency.
We perform extensive experiments -- over 200 experiments encompassing 4800
models -- and report multiple valuable and sometimes surprising findings, e.g.,
that transfer almost always leads to better or equivalent performance if
following a particular recipe, that QA-based commonsense datasets transfer well
with each other, while commonsense knowledge graphs do not, and that perhaps
counter-intuitively, larger models benefit more from transfer than smaller
ones.
Last but not least, we introduce a new universal commonsense reasoning model,
UNICORN, that establishes new state-of-the-art performance across 8 popular
commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA
(90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA
(79.3%).Comment: 27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For
associated code and data see https://github.com/allenai/rainbo
- …