21 research outputs found

    A hierarchical taxonomy for classifying hardness of inference tasks

    Get PDF
    International audienceExhibiting inferential capabilities is one of the major goals of many modern Natural Language Processing systems. However, if attempts have been made to define what textual inferences are, few seek to classify inference phenomena by difficulty. In this paper we propose a hierarchical taxonomy for inferences, relatively to their hardness, and with corpus annotation and system design and evaluation in mind. Indeed, a fine-grained assessment of the difficulty of a task allows us to design more appropriate systems and to evaluate them only on what they are designed to handle. Each of seven classes is described and provided with examples from different tasks like question answering, textual entailment and coreference resolution. We then test the classes of our hierarchy on the specific task of question answering. Our annotation process of the testing data at the QA4MRE 2013 evaluation campaign reveals that it is possible to quantify the contrasts in types of difficulty on datasets of the same task

    ARNLI: ARABIC NATURAL LANGUAGE INFERENCE ENTAILMENT AND CONTRADICTION DETECTION

    Get PDF
    Natural Language Inference (NLI) is a hot topic research in natural language processing, contradiction detection between sentences is a special case of NLI. This is considered a difficult NLP task which has a big influence when added as a component in many NLP applications, such as Question Answering Systems, text Summarization. Arabic Language is one of the most challenging low-resources languages in detecting contradictions due to its rich lexical, semantics ambiguity. We have created a dataset of more than 12k sentences and named ArNLI, that will be publicly available. Moreover, we have applied a new model inspired by Stanford contradiction detection proposed solutions on English language. We proposed an approach to detect contradictions between pairs of sentences in Arabic language using contradiction vector combined with language model vector as an input to machine learning model. We analyzed results of different traditional machine learning classifiers and compared their results on our created dataset (ArNLI) and on an automatic translation of both PHEME, SICK English datasets. Best results achieved using Random Forest classifier with an accuracy of 99%, 60%, 75% on PHEME, SICK and ArNLI respectively

    Selecting answers with structured lexical expansion and discourse relations: LIMSI's participation at QA4MRE 2013

    Get PDF
    International audiencen this paper, we present the LIMSI’s participation to QA4MRE2013. We decided to test two kinds of methods. The first one focuses on complex questions, such as causal questions, and exploits discourse relations. Relation recognition shows promising results, however it has to be improved to have an impact on answer selection. The second method is based on semantic variations. We explored the English Wiktionary to find reformulations of words in the definitions, and used these reformulations to index the documents and select passages in the Entrance exams task

    IDRAAQ: New Arabic Question Answering system based on Query Expansion and Passage Retrieval

    Full text link
    Arabic is one of the languages which are less concerned by researchers in the field of Question Answering. The paper presents core modules of a new Arabic Question Answering system called IDRAAQ. These modules aim at enhancing the quality of retrieved passages with respect to a given question. Experiments have been conducted in the framework of the main task of QA4MRE@CLEF 2012 that includes this year the Arabic language. Two runs were submitted. Both runs only use reading test documents to answer questions. The difference between the two runs exists in the answer validation process which is more relaxed in the second run. The Passage Retrieval (PR) module of our system presents multi-levels of processing in order to improve the quality of returned passage and thereafter the performances of the whole system. The PR module of IDRAAQ is based on keyword-based and structure-based levels that respectively consist in: (i) a Query Expansion (QE) process relying on Arabic WordNet semantic relations; (ii) a Distance Density N-gram Model based passage retrieval system. The latter level uses passages retrieved on the basis of QE queries and re-ranks them according to a structure-based similarity score. Named Entities are recognized by means of a mapping between the YAGO ontology and Arabic WordNet. The experiments that we conducted show that with respect to the accuracy and c@1 measure, IDRAAQ registered encouraging performances in particular with factoid questions. The same experiments allowed us to identify the lacks of the system especially when processing non factoid questions and at the Answer Validation stage. The IDRAAQ system, which is still under construction, will integrate a Conceptual Graph-based passage re-ranking introducing a semantic level to its PR module.Abouenour, L.; Bouzoubaa, K.; Rosso, P. (2012). IDRAAQ: New Arabic Question Answering system based on Query Expansion and Passage Retrieval. CELCT. http://hdl.handle.net/10251/46316

    Benchmarking Machine Reading Comprehension: A Psychological Perspective

    Get PDF
    Machine reading comprehension (MRC) has received considerable attention as a benchmark for natural language understanding. However, the conventional task design of MRC lacks explainability beyond the model interpretation, i.e., reading comprehension by a model cannot be explained in human terms. To this end, this position paper provides a theoretical basis for the design of MRC datasets based on psychology as well as psychometrics, and summarizes it in terms of the prerequisites for benchmarking MRC. We conclude that future datasets should (i) evaluate the capability of the model for constructing a coherent and grounded representation to understand context-dependent situations and (ii) ensure substantive validity by shortcut-proof questions and explanation as a part of the task design.Comment: 21 pages, EACL 202

    Towards Mitigating Hallucination in Large Language Models via Self-Reflection

    Full text link
    Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. However, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. This issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets. Our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. To tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. Through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. Consequently, we harness the interactivity and multitasking ability of LLMs and produce progressively more precise and accurate answers. Experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.Comment: Accepted by the findings of EMNLP 202

    Recherche d'information précise dans des sources d'information structurées et non structurées: défis, approches et hybridation.

    Get PDF
    National audienceCet article propose une synthèse d'une part sur les approches développées en questions-réponses (QR) sur du texte, en insistant plus particulièrement sur les modèles exploitant des représentations structurées des textes, et d'autre part sur les approches récentes en QR sur des bases de connaissances. Notre objectif est de montrer les problématiques communes et le rapprochement possible de ces deux types de recherche de réponses en prenant appui sur la reconnaissance des relations présentes dans les énoncés textuels et dans les bases de connaissances. Nous présentons les quelques travaux relevant de ce type d'approche afin de mettre en perspective les questions ouvertes pour aller vers des systèmes réellement hybrides ancrés sur des représentations sémantiques

    REVISITING RECOGNIZING TEXTUAL ENTAILMENT FOR EVALUATING NATURAL LANGUAGE PROCESSING SYSTEMS

    Get PDF
    Recognizing Textual Entailment (RTE) began as a unified framework to evaluate the reasoning capabilities of Natural Language Processing (NLP) models. In recent years, RTE has evolved in the NLP community into a task that researchers focus on developing models for. This thesis revisits the tradition of RTE as an evaluation framework for NLP models, especially in the era of deep learning. Chapter 2 provides an overview of different approaches to evaluating NLP sys- tems, discusses prior RTE datasets, and argues why many of them do not serve as satisfactory tests to evaluate the reasoning capabilities of NLP systems. Chapter 3 presents a new large-scale diverse collection of RTE datasets (DNC) that tests how well NLP systems capture a range of semantic phenomena that are integral to un- derstanding human language. Chapter 4 demonstrates how the DNC can be used to evaluate reasoning capabilities of NLP models. Chapter 5 discusses the limits of RTE as an evaluation framework by illuminating how existing datasets contain biases that may enable crude modeling approaches to perform surprisingly well. The remaining aspects of the thesis focus on issues raised in Chapter 5. Chapter 6 addresses issues in prior RTE datasets focused on paraphrasing and presents a high-quality test set that can be used to analyze how robust RTE systems are to paraphrases. Chapter 7 demonstrates how modeling approaches on biases, e.g. adversarial learning, can enable RTE models overcome biases discussed in Chapter 5. Chapter 8 applies these methods to the task of discovering emergency needs during disaster events
    corecore