664 research outputs found
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
We introduce a framework for unsupervised learning of structured predictors
with overlapping, global features. Each input's latent representation is
predicted conditional on the observable data using a feature-rich conditional
random field. Then a reconstruction of the input is (re)generated, conditional
on the latent structure, using models for which maximum likelihood estimation
has a closed-form. Our autoencoder formulation enables efficient learning
without making unrealistic independence assumptions or restricting the kinds of
features that can be used. We illustrate insightful connections to traditional
autoencoders, posterior regularization and multi-view learning. We show
competitive results with instantiations of the model for two canonical NLP
tasks: part-of-speech induction and bitext word alignment, and show that
training our model can be substantially more efficient than comparable
feature-rich baselines
Entropy and Graph Based Modelling of Document Coherence using Discourse Entities: An Application
We present two novel models of document coherence and their application to
information retrieval (IR). Both models approximate document coherence using
discourse entities, e.g. the subject or object of a sentence. Our first model
views text as a Markov process generating sequences of discourse entities
(entity n-grams); we use the entropy of these entity n-grams to approximate the
rate at which new information appears in text, reasoning that as more new words
appear, the topic increasingly drifts and text coherence decreases. Our second
model extends the work of Guinaudeau & Strube [28] that represents text as a
graph of discourse entities, linked by different relations, such as their
distance or adjacency in text. We use several graph topology metrics to
approximate different aspects of the discourse flow that can indicate
coherence, such as the average clustering or betweenness of discourse entities
in text. Experiments with several instantiations of these models show that: (i)
our models perform on a par with two other well-known models of text coherence
even without any parameter tuning, and (ii) reranking retrieval results
according to their coherence scores gives notable performance gains, confirming
a relation between document coherence and relevance. This work contributes two
novel models of document coherence, the application of which to IR complements
recent work in the integration of document cohesiveness or comprehensibility to
ranking [5, 56]
Statistical semantic processing using Markov logic
Markov Logic (ML) is a novel approach to Natural Language Processing tasks
[Richardson and Domingos, 2006; Riedel, 2008]. It is a Statistical Relational Learning
language based on First Order Logic (FOL) and Markov Networks (MN). It allows
one to treat a task as structured classification. In this work, we investigate ML for the
semantic processing tasks of Spoken Language Understanding (SLU) and Semantic
Role Labelling (SRL). Both tasks consist of identifying a semantic representation for
the meaning of a given utterance/sentence. However, they differ in nature: SLU is in
the field of dialogue systems where the domain is closed and language is spoken [He
and Young, 2005], while SRL is for open domains and traditionally for written text
[M´arquez et al., 2008].
Robust SLU is a key component of spoken dialogue systems. This component consists
of identifying the meaning of the user utterances addressed to the system. Recent
statistical approaches to SLU depend on additional resources (e.g., gazetteers, grammars,
syntactic treebanks) which are expensive and time-consuming to produce and
maintain. On the other hand, simple datasets annotated only with slot-values are commonly
used in dialogue system development, and are easy to collect, automatically
annotate, and update. However, slot-values leave out some of the fine-grained long
distance dependencies present in other semantic representations. In this work we investigate
the development of SLU modules with minimum resources with slot-values
as their semantic representation. We propose to use the ML to capture long distance dependencies
which are not explicitly available in the slot-value semantic representation.
We test the adequacy of the ML framework by comparing against a set of baselines
using state of the art approaches to semantic processing. The results of this research
have been published in Meza-Ruiz et al. [2008a,b].
Furthermore, we address the question of scalability of the ML approach for other
NLP tasks involving the identification of semantic representations. In particular, we
focus on SRL: the task of identifying predicates and arguments within sentences, together
with their semantic roles. The semantic representation built during SRL is more
complex than the slot-values used in dialogue systems, in the sense that they include
the notion of predicate/argument scope. SRL is defined in the context of open domains
under the premises that there are several levels of extra resources (lemmas, POS tags,
constituent or dependency parses). In this work, we propose a ML model of SRL and
experiment with the different architectures we can describe for the model which gives us an insight into the types of correlations that the ML model can express [Riedel and
Meza-Ruiz, 2008; Meza-Ruiz and Riedel, 2009].
Additionally, we tested our minimal resources setup in a state of the art dialogue
system: the TownInfo system. In this case, we were given a small dataset of gold
standard semantic representations which were system dependent, and we rapidly developed
a SLU module used in the functioning dialogue system. No extra resources
were necessary in order to reach state of the art results
Ontology population for open-source intelligence: A GATE-based solution
Open-Source INTelligence is intelligence based on publicly available sources such as news sites, blogs, forums, etc. The Web is the primary source of information, but once data are crawled, they need to be interpreted and structured. Ontologies may play a crucial role in this process, but because of the vast amount of documents available, automatic mechanisms for their population are needed, starting from the crawled text. This paper presents an approach for the automatic population of predefined ontologies with data extracted from text and discusses the design and realization of a pipeline based on the General Architecture for Text Engineering system, which is interesting for both researchers and practitioners in the field. Some experimental results that are encouraging in terms of extracted correct instances of the ontology are also reported. Furthermore, the paper also describes an alternative approach and provides additional experiments for one of the phases of our pipeline, which requires the use of predefined dictionaries for relevant entities. Through such a variant, the manual workload required in this phase was reduced, still obtaining promising results
- …