1,633 research outputs found
Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities
Information Extraction (IE) is the task of automatically extracting
structured information from unstructured/semi-structured machine-readable
documents. Among various IE tasks, extracting actionable intelligence from
ever-increasing amount of data depends critically upon Cross-Document
Coreference Resolution (CDCR) - the task of identifying entity mentions across
multiple documents that refer to the same underlying entity. Recently, document
datasets of the order of peta-/tera-bytes has raised many challenges for
performing effective CDCR such as scaling to large numbers of mentions and
limited representational power. The problem of analysing such datasets is
called "big data". The aim of this paper is to provide readers with an
understanding of the central concepts, subtasks, and the current
state-of-the-art in CDCR process. We provide assessment of existing
tools/techniques for CDCR subtasks and highlight big data challenges in each of
them to help readers identify important and outstanding issues for further
investigation. Finally, we provide concluding remarks and discuss possible
directions for future work
The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers
Solving mathematical word problems (MWPs) automatically is challenging,
primarily due to the semantic gap between human-readable words and
machine-understandable logics. Despite the long history dated back to the1960s,
MWPs have regained intensive attention in the past few years with the
advancement of Artificial Intelligence (AI). Solving MWPs successfully is
considered as a milestone towards general AI. Many systems have claimed
promising results in self-crafted and small-scale datasets. However, when
applied on large and diverse datasets, none of the proposed methods in the
literature achieves high precision, revealing that current MWP solvers still
have much room for improvement. This motivated us to present a comprehensive
survey to deliver a clear and complete picture of automatic math problem
solvers. In this survey, we emphasize on algebraic word problems, summarize
their extracted features and proposed techniques to bridge the semantic gap and
compare their performance in the publicly accessible datasets. We also cover
automatic solvers for other types of math problems such as geometric problems
that require the understanding of diagrams. Finally, we identify several
emerging research directions for the readers with interests in MWPs.Comment: 18 pages, 5 figure
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
We develop a Bayesian framework for tackling the supervised clustering
problem, the generic problem encountered in tasks such as reference matching,
coreference resolution, identity uncertainty and record linkage. Our clustering
model is based on the Dirichlet process prior, which enables us to define
distributions over the countably infinite sets that naturally arise in this
problem. We add supervision to our model by positing the existence of a set of
unobserved random variables (we call these "reference types") that are generic
across all clusters. Inference in our framework, which requires integrating
over infinitely many parameters, is solved using Markov chain Monte Carlo
techniques. We present algorithms for both conjugate and non-conjugate priors.
We present a simple--but general--parameterization of our model based on a
Gaussian assumption. We evaluate this model on one artificial task and three
real-world tasks, comparing it against both unsupervised and state-of-the-art
supervised algorithms. Our results show that our model is able to outperform
other models across a variety of tasks and performance metrics
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
One long-term goal of machine learning research is to produce methods that
are applicable to reasoning and natural language, in particular building an
intelligent dialogue agent. To measure progress towards that goal, we argue for
the usefulness of a set of proxy tasks that evaluate reading comprehension via
question answering. Our tasks measure understanding in several ways: whether a
system is able to answer questions via chaining facts, simple induction,
deduction and many more. The tasks are designed to be prerequisites for any
system that aims to be capable of conversing with a human. We believe many
existing learning systems can currently not solve them, and hence our aim is to
classify these tasks into skill sets, so that researchers can identify (and
then rectify) the failings of their systems. We also extend and improve the
recently introduced Memory Networks model, and show it is able to solve some,
but not all, of the tasks
Detecting and Extracting Events from Text Documents
Events of various kinds are mentioned and discussed in text documents,
whether they are books, news articles, blogs or microblog feeds. The paper
starts by giving an overview of how events are treated in linguistics and
philosophy. We follow this discussion by surveying how events and associated
information are handled in computationally. In particular, we look at how
textual documents can be mined to extract events and ancillary information.
These days, it is mostly through the application of various machine learning
techniques. We also discuss applications of event detection and extraction
systems, particularly in summarization, in the medical domain and in the
context of Twitter posts. We end the paper with a discussion of challenges and
future directions.Comment: This is work in progress. Please email [email protected] with any
comments for improvemen
A Deterministic Algorithm for Bridging Anaphora Resolution
Previous work on bridging anaphora resolution (Poesio et al., 2004; Hou et
al., 2013b) use syntactic preposition patterns to calculate word relatedness.
However, such patterns only consider NPs' head nouns and hence do not fully
capture the semantics of NPs. Recently, Hou (2018) created word embeddings
(embeddings_PP) to capture associative similarity (ie, relatedness) between
nouns by exploring the syntactic structure of noun phrases. But embeddings_PP
only contains word representations for nouns. In this paper, we create new word
vectors by combining embeddings_PP with GloVe. This new word embeddings
(embeddings_bridging) are a more general lexical knowledge resource for
bridging and allow us to represent the meaning of an NP beyond its head easily.
We therefore develop a deterministic approach for bridging anaphora resolution,
which represents the semantics of an NP based on its head noun and
modifications. We show that this simple approach achieves the competitive
results compared to the best system in Hou et al.(2013b) which explores Markov
Logic Networks to model the problem. Additionally, we further improve the
results for bridging anaphora resolution reported in Hou (2018) by combining
our simple deterministic approach with Hou et al.(2013b)'s best system MLN II.Comment: 11 page
This before That: Causal Precedence in the Biomedical Domain
Causal precedence between biochemical interactions is crucial in the
biomedical domain, because it transforms collections of individual
interactions, e.g., bindings and phosphorylations, into the causal mechanisms
needed to inform meaningful search and inference. Here, we analyze causal
precedence in the biomedical domain as distinct from open-domain, temporal
precedence. First, we describe a novel, hand-annotated text corpus of causal
precedence in the biomedical domain. Second, we use this corpus to investigate
a battery of models of precedence, covering rule-based, feature-based, and
latent representation models. The highest-performing individual model achieved
a micro F1 of 43 points, approaching the best performers on the simpler
temporal-only precedence tasks. Feature-based and latent representation models
each outperform the rule-based models, but their performance is complementary
to one another. We apply a sieve-based architecture to capitalize on this lack
of overlap, achieving a micro F1 score of 46 points.Comment: To appear in the proceedings of the 2016 Workshop on Biomedical
Natural Language Processing (BioNLP 2016
Implicit Argument Prediction as Reading Comprehension
Implicit arguments, which cannot be detected solely through syntactic cues,
make it harder to extract predicate-argument tuples. We present a new model for
implicit argument prediction that draws on reading comprehension, casting the
predicate-argument tuple with the missing argument as a query. We also draw on
pointer networks and multi-hop computation. Our model shows good performance on
an argument cloze task as well as on a nominal implicit argument prediction
task.Comment: Accepted at AAAI 201
Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution
Most existing approaches for zero pronoun resolution are heavily relying on
annotated data, which is often released by shared task organizers. Therefore,
the lack of annotated data becomes a major obstacle in the progress of zero
pronoun resolution task. Also, it is expensive to spend manpower on labeling
the data for better performance. To alleviate the problem above, in this paper,
we propose a simple but novel approach to automatically generate large-scale
pseudo training data for zero pronoun resolution. Furthermore, we successfully
transfer the cloze-style reading comprehension neural network model into zero
pronoun resolution task and propose a two-step training mechanism to overcome
the gap between the pseudo training data and the real one. Experimental results
show that the proposed approach significantly outperforms the state-of-the-art
systems with an absolute improvements of 3.1% F-score on OntoNotes 5.0 data.Comment: 8+2 pages, published as a conference paper at ACL2017 (long paper
A Tidy Data Model for Natural Language Processing using cleanNLP
The package cleanNLP provides a set of fast tools for converting a textual
corpus into a set of normalized tables. The underlying natural language
processing pipeline utilizes Stanford's CoreNLP library, exposing a number of
annotation tasks for text written in English, French, German, and Spanish.
Annotators include tokenization, part of speech tagging, named entity
recognition, entity linking, sentiment analysis, dependency parsing,
coreference resolution, and information extraction.Comment: 20 pages; 4 figure
- …