Search CORE

1,633 research outputs found

Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities

Author: Beheshti Seyed-Mehdi-Reza
Benatallah Boualem
Ryu Seung Hwan
Venugopal Srikumar
Wang Wei
Publication venue
Publication date: 14/11/2013
Field of study

Information Extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from ever-increasing amount of data depends critically upon Cross-Document Coreference Resolution (CDCR) - the task of identifying entity mentions across multiple documents that refer to the same underlying entity. Recently, document datasets of the order of peta-/tera-bytes has raised many challenges for performing effective CDCR such as scaling to large numbers of mentions and limited representational power. The problem of analysing such datasets is called "big data". The aim of this paper is to provide readers with an understanding of the central concepts, subtasks, and the current state-of-the-art in CDCR process. We provide assessment of existing tools/techniques for CDCR subtasks and highlight big data challenges in each of them to help readers identify important and outstanding issues for further investigation. Finally, we provide concluding remarks and discuss possible directions for future work

arXiv.org e-Print Archive

The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers

Author: Dai Bing Tian
Shen Heng Tao
Wang Lei
Zhang Dongxiang
Zhang Luming
Publication venue
Publication date: 29/04/2019
Field of study

Solving mathematical word problems (MWPs) automatically is challenging, primarily due to the semantic gap between human-readable words and machine-understandable logics. Despite the long history dated back to the1960s, MWPs have regained intensive attention in the past few years with the advancement of Artificial Intelligence (AI). Solving MWPs successfully is considered as a milestone towards general AI. Many systems have claimed promising results in self-crafted and small-scale datasets. However, when applied on large and diverse datasets, none of the proposed methods in the literature achieves high precision, revealing that current MWP solvers still have much room for improvement. This motivated us to present a comprehensive survey to deliver a clear and complete picture of automatic math problem solvers. In this survey, we emphasize on algebraic word problems, summarize their extracted features and proposed techniques to bridge the semantic gap and compare their performance in the publicly accessible datasets. We also cover automatic solvers for other types of math problems such as geometric problems that require the understanding of diagrams. Finally, we identify several emerging research directions for the readers with interests in MWPs.Comment: 18 pages, 5 figure

arXiv.org e-Print Archive

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

Author: Daumé III Hal
Marcu Daniel
Publication venue
Publication date: 04/07/2009
Field of study

We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these "reference types") that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple--but general--parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics

arXiv.org e-Print Archive

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Author: Bordes Antoine
Chopra Sumit
Joulin Armand
Mikolov Tomas
Rush Alexander M.
van Merriënboer Bart
Weston Jason
Publication venue
Publication date: 31/12/2015
Field of study

One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks

arXiv.org e-Print Archive

Detecting and Extracting Events from Text Documents

Author: Kalita Jugal
Publication venue
Publication date: 15/01/2016
Field of study

Events of various kinds are mentioned and discussed in text documents, whether they are books, news articles, blogs or microblog feeds. The paper starts by giving an overview of how events are treated in linguistics and philosophy. We follow this discussion by surveying how events and associated information are handled in computationally. In particular, we look at how textual documents can be mined to extract events and ancillary information. These days, it is mostly through the application of various machine learning techniques. We also discuss applications of event detection and extraction systems, particularly in summarization, in the medical domain and in the context of Twitter posts. We end the paper with a discussion of challenges and future directions.Comment: This is work in progress. Please email [email protected] with any comments for improvemen

arXiv.org e-Print Archive

A Deterministic Algorithm for Bridging Anaphora Resolution

Author: Hou Yufang
Publication venue
Publication date: 14/11/2018
Field of study

Previous work on bridging anaphora resolution (Poesio et al., 2004; Hou et al., 2013b) use syntactic preposition patterns to calculate word relatedness. However, such patterns only consider NPs' head nouns and hence do not fully capture the semantics of NPs. Recently, Hou (2018) created word embeddings (embeddings_PP) to capture associative similarity (ie, relatedness) between nouns by exploring the syntactic structure of noun phrases. But embeddings_PP only contains word representations for nouns. In this paper, we create new word vectors by combining embeddings_PP with GloVe. This new word embeddings (embeddings_bridging) are a more general lexical knowledge resource for bridging and allow us to represent the meaning of an NP beyond its head easily. We therefore develop a deterministic approach for bridging anaphora resolution, which represents the semantics of an NP based on its head noun and modifications. We show that this simple approach achieves the competitive results compared to the best system in Hou et al.(2013b) which explores Markov Logic Networks to model the problem. Additionally, we further improve the results for bridging anaphora resolution reported in Hou (2018) by combining our simple deterministic approach with Hou et al.(2013b)'s best system MLN II.Comment: 11 page

arXiv.org e-Print Archive

This before That: Causal Precedence in the Biomedical Domain

Author: Bell Dane
Hahn-Powell Gus
Surdeanu Mihai
Valenzuela-Escárcega Marco A.
Publication venue
Publication date: 26/06/2016
Field of study

Causal precedence between biochemical interactions is crucial in the biomedical domain, because it transforms collections of individual interactions, e.g., bindings and phosphorylations, into the causal mechanisms needed to inform meaningful search and inference. Here, we analyze causal precedence in the biomedical domain as distinct from open-domain, temporal precedence. First, we describe a novel, hand-annotated text corpus of causal precedence in the biomedical domain. Second, we use this corpus to investigate a battery of models of precedence, covering rule-based, feature-based, and latent representation models. The highest-performing individual model achieved a micro F1 of 43 points, approaching the best performers on the simpler temporal-only precedence tasks. Feature-based and latent representation models each outperform the rule-based models, but their performance is complementary to one another. We apply a sieve-based architecture to capitalize on this lack of overlap, achieving a micro F1 score of 46 points.Comment: To appear in the proceedings of the 2016 Workshop on Biomedical Natural Language Processing (BioNLP 2016

arXiv.org e-Print Archive

Implicit Argument Prediction as Reading Comprehension

Author: Cheng Pengxiang
Erk Katrin
Publication venue
Publication date: 08/11/2018
Field of study

Implicit arguments, which cannot be detected solely through syntactic cues, make it harder to extract predicate-argument tuples. We present a new model for implicit argument prediction that draws on reading comprehension, casting the predicate-argument tuple with the missing argument as a query. We also draw on pointer networks and multi-hop computation. Our model shows good performance on an argument cloze task as well as on a nominal implicit argument prediction task.Comment: Accepted at AAAI 201

arXiv.org e-Print Archive

Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution

Author: Cui Yiming
Hu Guoping
Liu Ting
Wang Shijin
Yin Qingyu
Zhang Weinan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 05/06/2017
Field of study

Most existing approaches for zero pronoun resolution are heavily relying on annotated data, which is often released by shared task organizers. Therefore, the lack of annotated data becomes a major obstacle in the progress of zero pronoun resolution task. Also, it is expensive to spend manpower on labeling the data for better performance. To alleviate the problem above, in this paper, we propose a simple but novel approach to automatically generate large-scale pseudo training data for zero pronoun resolution. Furthermore, we successfully transfer the cloze-style reading comprehension neural network model into zero pronoun resolution task and propose a two-step training mechanism to overcome the gap between the pseudo training data and the real one. Experimental results show that the proposed approach significantly outperforms the state-of-the-art systems with an absolute improvements of 3.1% F-score on OntoNotes 5.0 data.Comment: 8+2 pages, published as a conference paper at ACL2017 (long paper

arXiv.org e-Print Archive

A Tidy Data Model for Natural Language Processing using cleanNLP

Author: Arnold Taylor
Publication venue
Publication date: 03/05/2018
Field of study

The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.Comment: 20 pages; 4 figure

arXiv.org e-Print Archive