418 research outputs found

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

    Information fusion for automated question answering

    Get PDF
    Until recently, research efforts in automated Question Answering (QA) have mainly focused on getting a good understanding of questions to retrieve correct answers. This includes deep parsing, lookups in ontologies, question typing and machine learning of answer patterns appropriate to question forms. In contrast, I have focused on the analysis of the relationships between answer candidates as provided in open domain QA on multiple documents. I argue that such candidates have intrinsic properties, partly regardless of the question, and those properties can be exploited to provide better quality and more user-oriented answers in QA.Information fusion refers to the technique of merging pieces of information from different sources. In QA over free text, it is motivated by the frequency with which different answer candidates are found in different locations, leading to a multiplicity of answers. The reason for such multiplicity is, in part, the massive amount of data used for answering, and also its unstructured and heterogeneous content: Besides am¬ biguities in user questions leading to heterogeneity in extractions, systems have to deal with redundancy, granularity and possible contradictory information. Hence the need for answer candidate comparison. While frequency has proved to be a significant char¬ acteristic of a correct answer, I evaluate the value of other relationships characterizing answer variability and redundancy.Partially inspired by recent developments in multi-document summarization, I re¬ define the concept of "answer" within an engineering approach to QA based on the Model-View-Controller (MVC) pattern of user interface design. An "answer model" is a directed graph in which nodes correspond to entities projected from extractions and edges convey relationships between such nodes. The graph represents the fusion of information contained in the set of extractions. Different views of the answer model can be produced, capturing the fact that the same answer can be expressed and pre¬ sented in various ways: picture, video, sound, written or spoken language, or a formal data structure. Within this framework, an answer is a structured object contained in the model and retrieved by a strategy to build a particular view depending on the end user (or taskj's requirements.I describe shallow techniques to compare entities and enrich the model by discovering four broad categories of relationships between entities in the model: equivalence, inclusion, aggregation and alternative. Quantitatively, answer candidate modeling im¬ proves answer extraction accuracy. It also proves to be more robust to incorrect answer candidates than traditional techniques. Qualitatively, models provide meta-information encoded by relationships that allow shallow reasoning to help organize and generate the final output

    Answering Complex Questions by Joining Multi-Document Evidence with Quasi Knowledge Graphs

    No full text
    Direct answering of questions that involve multiple entities and relations is a challenge for text-based QA. This problem is most pronounced when answers can be found only by joining evidence from multiple documents. Curated knowledge graphs (KGs) may yield good answers, but are limited by their inherent incompleteness and potential staleness. This paper presents QUEST, a method that can answer complex questions directly from textual sources on-the-fly, by computing similarity joins over partial results from different documents. Our method is completely unsupervised, avoiding training-data bottlenecks and being able to cope with rapidly evolving ad hoc topics and formulation style in user questions. QUEST builds a noisy quasi KG with node and edge weights, consisting of dynamically retrieved entity names and relational phrases. It augments this graph with types and semantic alignments, and computes the best answers by an algorithm for Group Steiner Trees. We evaluate QUEST on benchmarks of complex questions, and show that it substantially outperforms state-of-the-art baselines

    Towards More Robust Natural Language Understanding

    Get PDF
    Natural Language Understanding (NLU) is a branch of Natural Language Processing (NLP) that uses intelligent computer software to understand texts that encode human knowledge. Recent years have witnessed notable progress across various NLU tasks with deep learning techniques, especially with pretrained language models. Besides proposing more advanced model architectures, constructing more reliable and trustworthy datasets also plays a huge role in improving NLU systems, without which it would be impossible to train a decent NLU model. It's worth noting that the human ability of understanding natural language is flexible and robust. On the contrary, most of existing NLU systems fail to achieve desirable performance on out-of-domain data or struggle on handling challenging items (e.g., inherently ambiguous items, adversarial items) in the real world. Therefore, in order to have NLU models understand human language more effectively, it is expected to prioritize the study on robust natural language understanding. In this thesis, we deem that NLU systems are consisting of two components: NLU models and NLU datasets. As such, we argue that, to achieve robust NLU, the model architecture/training and the dataset are equally important. Specifically, we will focus on three NLU tasks to illustrate the robustness problem in different NLU tasks and our contributions (i.e., novel models and new datasets) to help achieve more robust natural language understanding. The major technical contributions of this thesis are: (1) We study how to utilize diversity boosters (e.g., beam search & QPP) to help neural question generator synthesize diverse QA pairs, upon which a Question Answering (QA) system is trained to improve the generalization on the unseen target domain. It's worth mentioning that our proposed QPP (question phrase prediction) module, which predicts a set of valid question phrases given an answer evidence, plays an important role in improving the cross-domain generalizability for QA systems. Besides, a target-domain test set is constructed and approved by the community to help evaluate the model robustness under the cross-domain generalization setting. (2) We investigate inherently ambiguous items in Natural Language Inference, for which annotators don't agree on the label. Ambiguous items are overlooked in the literature but often occurring in the real world. We build an ensemble model, AAs (Artificial Annotators), that simulates underlying annotation distribution to effectively identify such inherently ambiguous items. Our AAs are better at handling inherently ambiguous items since the model design captures the essence of the problem better than vanilla model architectures. (3) We follow a standard practice to build a robust dataset for FAQ retrieval task, COUGH. In our dataset analysis, we show how COUGH better reflects the challenge of FAQ retrieval in the real situation than its counterparts. The imposed challenge will push forward the boundary of research on FAQ retrieval in real scenarios. Moving forward, the ultimate goal for robust natural language understanding is to build NLU models which can behave humanly. That is, it's expected that robust NLU systems are capable to transfer the knowledge from training corpus to unseen documents more reliably and survive when encountering challenging items even if the system doesn't know a priori of users' inputs.No embargoAcademic Major: Computer Science and EngineeringAcademic Major: Industrial and Systems Engineerin

    Semantic Representation and Inference for NLP

    Full text link
    Semantic representation and inference is essential for Natural Language Processing (NLP). The state of the art for semantic representation and inference is deep learning, and particularly Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transformer Self-Attention models. This thesis investigates the use of deep learning for novel semantic representation and inference, and makes contributions in the following three areas: creating training data, improving semantic representations and extending inference learning. In terms of creating training data, we contribute the largest publicly available dataset of real-life factual claims for the purpose of automatic claim verification (MultiFC), and we present a novel inference model composed of multi-scale CNNs with different kernel sizes that learn from external sources to infer fact checking labels. In terms of improving semantic representations, we contribute a novel model that captures non-compositional semantic indicators. By definition, the meaning of a non-compositional phrase cannot be inferred from the individual meanings of its composing words (e.g., hot dog). Motivated by this, we operationalize the compositionality of a phrase contextually by enriching the phrase representation with external word embeddings and knowledge graphs. Finally, in terms of inference learning, we propose a series of novel deep learning architectures that improve inference by using syntactic dependencies, by ensembling role guided attention heads, incorporating gating layers, and concatenating multiple heads in novel and effective ways. This thesis consists of seven publications (five published and two under review).Comment: PhD thesis, the University of Copenhage
    corecore