4 research outputs found

    TACAM: Topic And Context Aware Argument Mining

    Full text link
    In this work we address the problem of argument search. The purpose of argument search is the distillation of pro and contra arguments for requested topics from large text corpora. In previous works, the usual approach is to use a standard search engine to extract text parts which are relevant to the given topic and subsequently use an argument recognition algorithm to select arguments from them. The main challenge in the argument recognition task, which is also known as argument mining, is that often sentences containing arguments are structurally similar to purely informative sentences without any stance about the topic. In fact, they only differ semantically. Most approaches use topic or search term information only for the first search step and therefore assume that arguments can be classified independently of a topic. We argue that topic information is crucial for argument mining, since the topic defines the semantic context of an argument. Precisely, we propose different models for the classification of arguments, which take information about a topic of an argument into account. Moreover, to enrich the context of a topic and to let models understand the context of the potential argument better, we integrate information from different external sources such as Knowledge Graphs or pre-trained NLP models. Our evaluation shows that considering topic information, especially in connection with external information, provides a significant performance boost for the argument mining task

    Learning from disagreement: a survey

    Get PDF
    Many tasks in Natural Language Processing (nlp) and Computer Vision (cv) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (ai) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on nlp and cv tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials

    Making the Most of Crowd Information: Learning and Evaluation in AI tasks with Disagreements.

    Get PDF
    PhD ThesesThere is plenty of evidence that humans disagree on the interpretation of many tasks in Natural Language Processing (nlp) and Computer Vision (cv), from objective tasks rooted in linguistics such as part-of-speech tagging to more subjective (observerdependent) tasks such as classifying an image or deciding whether a proposition follows from a certain premise. While most learning in Artificial Intelligence (ai) still relies on the assumption that a single interpretation, captured by the gold label, exists for each item, a growing research body in recent years has focused on learning methods that do not rely on this assumption. Rather, they aim to learn ranges of truth amidst disagreement. This PhD research makes a contribution to this field of study. Firstly, we analytically review the evidence for disagreement on nlp and cv tasks, focusing on tasks where substantial datasets with such information have been created. As part of this review, we also discuss the most popular approaches to training models from datasets containing multiple judgments and group these methods together according to their handling of disagreement. Secondly, we make three proposals for learning with disagreement; soft-loss, multi-task learning from gold and crowds, and automatic temperature-scaled soft-loss. Thirdly, we address one gap in this field of study – the prevalence of hard metrics for model evaluation even when the gold assumption is shown to be an idealization – by proposing several previously existing metrics and novel soft metrics that do not make this assumption and analyzing the merits and assumptions of all the metrics, hard and soft. Finally, we carry out a systematic investigation of the key proposals in learning with disagreement by training them across several tasks, considering several ways to evaluate the resulting models and assessing the conditions under which each approach is effective. This is a key contribution of this research as research in learning with disagreement do not often test proposals across tasks, compare proposals with a variety of approaches, or evaluate using both soft metrics and hard metrics. The results obtained suggest, first of all, that it is essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials
    corecore