31 research outputs found
A survey on Recognizing Textual Entailment as an NLP Evaluation
Recognizing Textual Entailment (RTE) was proposed as a unified evaluation framework to compare semantic understanding of different NLP systems. In this survey paper, we provide an overview of different approaches for evaluating and understanding the reasoning capabilities of NLP systems. We then focus our discussion on RTE by highlighting prominent RTE datasets as well as advances in RTE dataset that focus on specific linguistic phenomena that can be used to evaluate NLP systems on a fine-grained level. We conclude by arguing that when evaluating NLP systems, the community should utilize newly introduced RTE datasets that focus on specific linguistic phenomena
On the Evaluation of Semantic Phenomena in Neural Machine Translation Using Natural Language Inference
We propose a process for investigating the extent to which sentence
representations arising from neural machine translation (NMT) systems encode
distinct semantic phenomena. We use these representations as features to train
a natural language inference (NLI) classifier based on datasets recast from
existing semantic annotations. In applying this process to a representative NMT
system, we find its encoder appears most suited to supporting inferences at the
syntax-semantics interface, as compared to anaphora resolution requiring
world-knowledge. We conclude with a discussion on the merits and potential
deficiencies of the existing process, and how it may be improved and extended
as a broader framework for evaluating semantic coverage.Comment: To be presented at NAACL 2018 - 11 page
Hypothesis Only Baselines in Natural Language Inference
We propose a hypothesis only baseline for diagnosing Natural Language
Inference (NLI). Especially when an NLI dataset assumes inference is occurring
based purely on the relationship between a context and a hypothesis, it follows
that assessing entailment relations while ignoring the provided context is a
degenerate solution. Yet, through experiments on ten distinct NLI datasets, we
find that this approach, which we refer to as a hypothesis-only model, is able
to significantly outperform a majority class baseline across a number of NLI
datasets. Our analysis suggests that statistical irregularities may allow a
model to perform NLI in some datasets beyond what should be achievable without
access to the context.Comment: Accepted at *SEM 2018 as long paper. 12 page
REVISITING RECOGNIZING TEXTUAL ENTAILMENT FOR EVALUATING NATURAL LANGUAGE PROCESSING SYSTEMS
Recognizing Textual Entailment (RTE) began as a unified framework to evaluate the reasoning capabilities of Natural Language Processing (NLP) models. In recent years, RTE has evolved in the NLP community into a task that researchers focus on developing models for. This thesis revisits the tradition of RTE as an evaluation framework for NLP models, especially in the era of deep learning.
Chapter 2 provides an overview of different approaches to evaluating NLP sys- tems, discusses prior RTE datasets, and argues why many of them do not serve as satisfactory tests to evaluate the reasoning capabilities of NLP systems. Chapter 3 presents a new large-scale diverse collection of RTE datasets (DNC) that tests how well NLP systems capture a range of semantic phenomena that are integral to un- derstanding human language. Chapter 4 demonstrates how the DNC can be used to evaluate reasoning capabilities of NLP models. Chapter 5 discusses the limits of RTE as an evaluation framework by illuminating how existing datasets contain biases that may enable crude modeling approaches to perform surprisingly well.
The remaining aspects of the thesis focus on issues raised in Chapter 5. Chapter 6 addresses issues in prior RTE datasets focused on paraphrasing and presents a high-quality test set that can be used to analyze how robust RTE systems are to paraphrases. Chapter 7 demonstrates how modeling approaches on biases, e.g. adversarial learning, can enable RTE models overcome biases discussed in Chapter 5. Chapter 8 applies these methods to the task of discovering emergency needs during disaster events
Probing Neural Language Models for Human Tacit Assumptions
Humans carry stereotypic tacit assumptions (STAs) (Prince,1978), or propositional beliefs about generic concepts. Suchassociations are crucial for understanding natural language.We construct a diagnostic set of word prediction prompts toevaluate whether recent neural contextualized language mod-els trained on large text corpora capture STAs. Our promptsare based on human responses in a psychological study of con-ceptual associations. We find models to be profoundly effec-tive at retrieving concepts given associated properties. Our re-sults demonstrate empirical evidence that stereotypic concep-tual representations are captured in neural models derived fromsemi-supervised linguistic exposure
Quantum-chemical study of C-H bond dissociation enthalpies of various small non-aromatic organic molecules
Abstract: In this work, C-H bond dissociation enthalpies (BDE) and vertical ionization potentials (IP) for various hydrocarbons and ketones were calculated using four density functional approaches. Calculated BDEs and IPs were correlated with experimental data. The linearity of the corresponding dependences can be considered very good. Comparing two used functionals, B3LYP C-H BDE values are closer to experimental results than PBE0 values for both used basis sets. The 6-31G* basis set employed with both functionals, gives the C-H BDEs closer to the experimental values than the 6-311++G** basis set. Using the obtained linear dependences BDE exp = f (BDE calc ), the experimental values of C-H BDEs for some structurally related compounds can be estimated solely from calculations. As a descriptor of the C-H BDE, the IPs and 13 C NMR chemical shifts have been investigated using data obtained from the B3LYP/6-31G* calculations. There is a slight indication of linear correlation between IPs and C-H BDEs in the sets of simple alkanes and alkenes/ cycloalkenes. However, for cycloalkanes and aliphatic carbonyl compounds, no linear correlation was found. In the case of the 13 C NMR chemical shifts, the correlation with C-H BDEs can be found for the sets of alkanes and cycloalkanes, but for the other studied molecules, no trends were detected
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
We present a large-scale collection of diverse natural language inference
(NLI) datasets that help provide insight into how well a sentence
representation captures distinct types of reasoning. The collection results
from recasting 13 existing datasets from 7 semantic phenomena into a common NLI
structure, resulting in over half a million labeled context-hypothesis pairs in
total. We refer to our collection as the DNC: Diverse Natural Language
Inference Collection. The DNC is available online at https://www.decomp.net,
and will grow over time as additional resources are recast and added from novel
sources.Comment: To be presented at EMNLP 2018. 15 page
Evaluating Paraphrastic Robustness in Textual Entailment Models
We present PaRTE, a collection of 1,126 pairs of Recognizing Textual
Entailment (RTE) examples to evaluate whether models are robust to
paraphrasing. We posit that if RTE models understand language, their
predictions should be consistent across inputs that share the same meaning. We
use the evaluation set to determine if RTE models' predictions change when
examples are paraphrased. In our experiments, contemporary models change their
predictions on 8-16\% of paraphrased examples, indicating that there is still
room for improvement