329,082 research outputs found
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
When Do Discourse Markers Affect Computational Sentence Understanding?
The capabilities and use cases of automatic natural language processing (NLP)
have grown significantly over the last few years. While much work has been
devoted to understanding how humans deal with discourse connectives, this
phenomenon is understudied in computational systems. Therefore, it is important
to put NLP models under the microscope and examine whether they can adequately
comprehend, process, and reason within the complexity of natural language. In
this chapter, we introduce the main mechanisms behind automatic sentence
processing systems step by step and then focus on evaluating discourse
connective processing. We assess nine popular systems in their ability to
understand English discourse connectives and analyze how context and language
understanding tasks affect their connective comprehension. The results show
that NLP systems do not process all discourse connectives equally well and that
the computational processing complexity of different connective kinds is not
always consistently in line with the presumed complexity order found in human
processing. In addition, while humans are more inclined to be influenced during
the reading procedure but not necessarily in the final comprehension
performance, discourse connectives have a significant impact on the final
accuracy of NLP systems. The richer knowledge of connectives a system learns,
the more negative effect inappropriate connectives have on it. This suggests
that the correct explicitation of discourse connectives is important for
computational natural language processing.Comment: Chapter 7 of Discourse Markers in Interaction, published in Trends in
Linguistics. Studies and Monograph
REVISITING RECOGNIZING TEXTUAL ENTAILMENT FOR EVALUATING NATURAL LANGUAGE PROCESSING SYSTEMS
Recognizing Textual Entailment (RTE) began as a unified framework to evaluate the reasoning capabilities of Natural Language Processing (NLP) models. In recent years, RTE has evolved in the NLP community into a task that researchers focus on developing models for. This thesis revisits the tradition of RTE as an evaluation framework for NLP models, especially in the era of deep learning.
Chapter 2 provides an overview of different approaches to evaluating NLP sys- tems, discusses prior RTE datasets, and argues why many of them do not serve as satisfactory tests to evaluate the reasoning capabilities of NLP systems. Chapter 3 presents a new large-scale diverse collection of RTE datasets (DNC) that tests how well NLP systems capture a range of semantic phenomena that are integral to un- derstanding human language. Chapter 4 demonstrates how the DNC can be used to evaluate reasoning capabilities of NLP models. Chapter 5 discusses the limits of RTE as an evaluation framework by illuminating how existing datasets contain biases that may enable crude modeling approaches to perform surprisingly well.
The remaining aspects of the thesis focus on issues raised in Chapter 5. Chapter 6 addresses issues in prior RTE datasets focused on paraphrasing and presents a high-quality test set that can be used to analyze how robust RTE systems are to paraphrases. Chapter 7 demonstrates how modeling approaches on biases, e.g. adversarial learning, can enable RTE models overcome biases discussed in Chapter 5. Chapter 8 applies these methods to the task of discovering emergency needs during disaster events
State-of-the-Art: Assessing Semantic Similarity in Automated Short-Answer Grading Systems
The use of semantic in Natural Language Processing (NLP) has sparked the interest of academics and businesses in various fields. One such field is Automated Short-answer Grading Systems (ASAGS) for automatically evaluating responses for similarity with the expected answer. ASAGS poses semantic challenges because the responses of a topic are in the responder’s own words. This study is providing an in-depth analysis of work to improve the assessment of semantic similarity between corpora in natural language in the context of ASAGS. Three popular semantic approaches are corpus- based, knowledge-based, and deep learning are used to evaluate against the conventional methods in ASAGS. Finally, the gaps in knowledge are identified and new research areas are proposed
Evaluating Conversational Recommender Systems: A Landscape of Research
Conversational recommender systems aim to interactively support online users
in their information search and decision-making processes in an intuitive way.
With the latest advances in voice-controlled devices, natural language
processing, and AI in general, such systems received increased attention in
recent years. Technically, conversational recommenders are usually complex
multi-component applications and often consist of multiple machine learning
models and a natural language user interface. Evaluating such a complex system
in a holistic way can therefore be challenging, as it requires (i) the
assessment of the quality of the different learning components, and (ii) the
quality perception of the system as a whole by users. Thus, a mixed methods
approach is often required, which may combine objective (computational) and
subjective (perception-oriented) evaluation techniques. In this paper, we
review common evaluation approaches for conversational recommender systems,
identify possible limitations, and outline future directions towards more
holistic evaluation practices
A Crowdsourced Frame Disambiguation Corpus with Ambiguity
We present a resource for the task of FrameNet semantic frame disambiguation
of over 5,000 word-sentence pairs from the Wikipedia corpus. The annotations
were collected using a novel crowdsourcing approach with multiple workers per
sentence to capture inter-annotator disagreement. In contrast to the typical
approach of attributing the best single frame to each word, we provide a list
of frames with disagreement-based scores that express the confidence with which
each frame applies to the word. This is based on the idea that inter-annotator
disagreement is at least partly caused by ambiguity that is inherent to the
text and frames. We have found many examples where the semantics of individual
frames overlap sufficiently to make them acceptable alternatives for
interpreting a sentence. We have argued that ignoring this ambiguity creates an
overly arbitrary target for training and evaluating natural language processing
systems - if humans cannot agree, why would we expect the correct answer from a
machine to be any different? To process this data we also utilized an expanded
lemma-set provided by the Framester system, which merges FN with WordNet to
enhance coverage. Our dataset includes annotations of 1,000 sentence-word pairs
whose lemmas are not part of FN. Finally we present metrics for evaluating
frame disambiguation systems that account for ambiguity.Comment: Accepted to NAACL-HLT201
Introducing a Gold Standard Corpus from Young Multilinguals for the Evaluation of Automatic UD-PoS Taggers for Italian
Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined.Part-of-speech (PoS) tagging constitutes a common task in Natural Language Processing (NLP), given its widespread applicability. However, with the advance of new information technologies and language variation, the contents and methods for PoS-tagging have changed. The majority of Italian existing data for this task originate from standard texts, where language use is far from multifaceted informal real-life situations. Automatic PoS-tagging models trained with such data do not perform reliably on non-standard language, like social media content or language learners’ texts. Our aim is to provide additional training and evaluation data from language learners tagged in Universal Dependencies (UD), as well as testing current automatic PoStagging systems and evaluating their performance on such data. We use a multilingual corpus of young language learners, LEONIDE, to create a tagged gold standard for evaluating UD PoStagging performance on the Italian nonstandard language. With the 3.7 version of Stanza, a Python NLP package, we apply available automatic PoS-taggers, namely ISDT, ParTUT, POSTWITA, TWITTIRÒ and VIT, trained with both standard and non-standard data, on our dataset. Our results show that the above taggers, trained on non-standard data or multilingual Treebanks, can achieve up to 95% of accuracy on multilingual learner data, if combined
- …