7,605 research outputs found
Inducing Features of Random Fields
We present a technique for constructing random fields from a set of training
samples. The learning paradigm builds increasingly complex fields by allowing
potential functions, or features, that are supported by increasingly large
subgraphs. Each feature has a weight that is trained by minimizing the
Kullback-Leibler divergence between the model and the empirical distribution of
the training data. A greedy algorithm determines how features are incrementally
added to the field and an iterative scaling algorithm is used to estimate the
optimal values of the weights.
The statistical modeling techniques introduced in this paper differ from
those common to much of the natural language processing literature since there
is no probabilistic finite state or push-down automaton on which the model is
built. Our approach also differs from the techniques common to the computer
vision literature in that the underlying random fields are non-Markovian and
have a large number of parameters that must be estimated. Relations to other
learning approaches including decision trees and Boltzmann machines are given.
As a demonstration of the method, we describe its application to the problem of
automatic word classification in natural language processing.
Key words: random field, Kullback-Leibler divergence, iterative scaling,
divergence geometry, maximum entropy, EM algorithm, statistical learning,
clustering, word morphology, natural language processingComment: 34 pages, compressed postscrip
Memory-Based Learning: Using Similarity for Smoothing
This paper analyses the relation between the use of similarity in
Memory-Based Learning and the notion of backed-off smoothing in statistical
language modeling. We show that the two approaches are closely related, and we
argue that feature weighting methods in the Memory-Based paradigm can offer the
advantage of automatically specifying a suitable domain-specific hierarchy
between most specific and most general conditioning information without the
need for a large number of parameters. We report two applications of this
approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art
performance in both domains, and allows the easy integration of diverse
information sources, such as rich lexical representations.Comment: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 9
Probabilistic Constraint Logic Programming
This paper addresses two central problems for probabilistic processing
models: parameter estimation from incomplete data and efficient retrieval of
most probable analyses. These questions have been answered satisfactorily only
for probabilistic regular and context-free models. We address these problems
for a more expressive probabilistic constraint logic programming model. We
present a log-linear probability model for probabilistic constraint logic
programming. On top of this model we define an algorithm to estimate the
parameters and to select the properties of log-linear models from incomplete
data. This algorithm is an extension of the improved iterative scaling
algorithm of Della-Pietra, Della-Pietra, and Lafferty (1995). Our algorithm
applies to log-linear models in general and is accompanied with suitable
approximation methods when applied to large data spaces. Furthermore, we
present an approach for searching for most probable analyses of the
probabilistic constraint logic programming model. This method can be applied to
the ambiguity resolution problem in natural language processing applications.Comment: 35 pages, uses sfbart.cl
The Measure of a Model
This paper describes measures for evaluating the three determinants of how
well a probabilistic classifier performs on a given test set. These
determinants are the appropriateness, for the test set, of the results of (1)
feature selection, (2) formulation of the parametric form of the model, and (3)
parameter estimation. These are part of any model formulation procedure, even
if not broken out as separate steps, so the tradeoffs explored in this paper
are relevant to a wide variety of methods. The measures are demonstrated in a
large experiment, in which they are used to analyze the results of roughly 300
classifiers that perform word-sense disambiguation.Comment: 12 pages, uuencoded compressed postscript fil
Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation
Existing approaches to automatic VerbNet-style verb classification are
heavily dependent on feature engineering and therefore limited to languages
with mature NLP pipelines. In this work, we propose a novel cross-lingual
transfer method for inducing VerbNets for multiple languages. To the best of
our knowledge, this is the first study which demonstrates how the architectures
for learning word embeddings can be applied to this challenging
syntactic-semantic task. Our method uses cross-lingual translation pairs to tie
each of the six target languages into a bilingual vector space with English,
jointly specialising the representations to encode the relational information
from English VerbNet. A standard clustering algorithm is then run on top of the
VerbNet-specialised representations, using vector dimensions as features for
learning verb classes. Our results show that the proposed cross-lingual
transfer approach sets new state-of-the-art verb classification performance
across all six target languages explored in this work.Comment: EMNLP 2017 (long paper
Text Segmentation Using Exponential Models
This paper introduces a new statistical approach to partitioning text
automatically into coherent segments. Our approach enlists both short-range and
long-range language models to help it sniff out likely sites of topic changes
in text. To aid its search, the system consults a set of simple lexical hints
it has learned to associate with the presence of boundaries through inspection
of a large corpus of annotated data. We also propose a new probabilistically
motivated error metric for use by the natural language processing and
information retrieval communities, intended to supersede precision and recall
for appraising segmentation algorithms. Qualitative assessment of our algorithm
as well as evaluation using this new metric demonstrate the effectiveness of
our approach in two very different domains, Wall Street Journal articles and
the TDT Corpus, a collection of newswire articles and broadcast news
transcripts.Comment: 12 pages, LaTeX source and postscript figures for EMNLP-2 pape
Domain Adaptation for Statistical Classifiers
The most basic assumption used in statistical learning theory is that
training data and test data are drawn from the same underlying distribution.
Unfortunately, in many applications, the "in-domain" test data is drawn from a
distribution that is related, but not identical, to the "out-of-domain"
distribution of the training data. We consider the common case in which labeled
out-of-domain data is plentiful, but labeled in-domain data is scarce. We
introduce a statistical formulation of this problem in terms of a simple
mixture model and present an instantiation of this framework to maximum entropy
classifiers and their linear chain counterparts. We present efficient inference
algorithms for this special case based on the technique of conditional
expectation maximization. Our experimental results show that our approach leads
to improved performance on three real world tasks on four different data sets
from the natural language processing domain
- …