932 research outputs found
An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities
We describe an extension of Earley's parser for stochastic context-free
grammars that computes the following quantities given a stochastic context-free
grammar and an input string: a) probabilities of successive prefixes being
generated by the grammar; b) probabilities of substrings being generated by the
nonterminals, including the entire string being generated by the grammar; c)
most likely (Viterbi) parse of the string; d) posterior expected number of
applications of each grammar production, as required for reestimating rule
probabilities. (a) and (b) are computed incrementally in a single left-to-right
pass over the input. Our algorithm compares favorably to standard bottom-up
parsing methods for SCFGs in that it works efficiently on sparse grammars by
making use of Earley's top-down control structure. It can process any
context-free rule format without conversion to some normal form, and combines
computations for (a) through (d) in a single algorithm. Finally, the algorithm
has simple extensions for processing partially bracketed inputs, and for
finding partial parses and their likelihoods on ungrammatical inputs.Comment: 45 pages. Slightly shortened version to appear in Computational
Linguistics 2
An improved parser for data-oriented lexical-functional analysis
We present an LFG-DOP parser which uses fragments from LFG-annotated
sentences to parse new sentences. Experiments with the Verbmobil and Homecentre
corpora show that (1) Viterbi n best search performs about 100 times faster
than Monte Carlo search while both achieve the same accuracy; (2) the DOP
hypothesis which states that parse accuracy increases with increasing fragment
size is confirmed for LFG-DOP; (3) LFG-DOP's relative frequency estimator
performs worse than a discounted frequency estimator; and (4) LFG-DOP
significantly outperforms Tree-DOP is evaluated on tree structures only.Comment: 8 page
Probabilistic Constraint Logic Programming
This paper addresses two central problems for probabilistic processing
models: parameter estimation from incomplete data and efficient retrieval of
most probable analyses. These questions have been answered satisfactorily only
for probabilistic regular and context-free models. We address these problems
for a more expressive probabilistic constraint logic programming model. We
present a log-linear probability model for probabilistic constraint logic
programming. On top of this model we define an algorithm to estimate the
parameters and to select the properties of log-linear models from incomplete
data. This algorithm is an extension of the improved iterative scaling
algorithm of Della-Pietra, Della-Pietra, and Lafferty (1995). Our algorithm
applies to log-linear models in general and is accompanied with suitable
approximation methods when applied to large data spaces. Furthermore, we
present an approach for searching for most probable analyses of the
probabilistic constraint logic programming model. This method can be applied to
the ambiguity resolution problem in natural language processing applications.Comment: 35 pages, uses sfbart.cl
Data-Oriented Language Processing. An Overview
During the last few years, a new approach to language processing has started
to emerge, which has become known under various labels such as "data-oriented
parsing", "corpus-based interpretation", and "tree-bank grammar" (cf. van den
Berg et al. 1994; Bod 1992-96; Bod et al. 1996a/b; Bonnema 1996; Charniak
1996a/b; Goodman 1996; Kaplan 1996; Rajman 1995a/b; Scha 1990-92; Sekine &
Grishman 1995; Sima'an et al. 1994; Sima'an 1995-96; Tugwell 1995). This
approach, which we will call "data-oriented processing" or "DOP", embodies the
assumption that human language perception and production works with
representations of concrete past language experiences, rather than with
abstract linguistic rules. The models that instantiate this approach therefore
maintain large corpora of linguistic representations of previously occurring
utterances. When processing a new input utterance, analyses of this utterance
are constructed by combining fragments from the corpus; the
occurrence-frequencies of the fragments are used to estimate which analysis is
the most probable one.
In this paper we give an in-depth discussion of a data-oriented processing
model which employs a corpus of labelled phrase-structure trees. Then we review
some other models that instantiate the DOP approach. Many of these models also
employ labelled phrase-structure trees, but use different criteria for
extracting fragments from the corpus or employ different disambiguation
strategies (Bod 1996b; Charniak 1996a/b; Goodman 1996; Rajman 1995a/b; Sekine &
Grishman 1995; Sima'an 1995-96); other models use richer formalisms for their
corpus annotations (van den Berg et al. 1994; Bod et al., 1996a/b; Bonnema
1996; Kaplan 1996; Tugwell 1995).Comment: 34 pages, Postscrip
Synchronization recovery and state model reduction for soft decoding of variable length codes
Variable length codes exhibit de-synchronization problems when transmitted
over noisy channels. Trellis decoding techniques based on Maximum A Posteriori
(MAP) estimators are often used to minimize the error rate on the estimated
sequence. If the number of symbols and/or bits transmitted are known by the
decoder, termination constraints can be incorporated in the decoding process.
All the paths in the trellis which do not lead to a valid sequence length are
suppressed. This paper presents an analytic method to assess the expected error
resilience of a VLC when trellis decoding with a sequence length constraint is
used. The approach is based on the computation, for a given code, of the amount
of information brought by the constraint. It is then shown that this quantity
as well as the probability that the VLC decoder does not re-synchronize in a
strict sense, are not significantly altered by appropriate trellis states
aggregation. This proves that the performance obtained by running a
length-constrained Viterbi decoder on aggregated state models approaches the
one obtained with the bit/symbol trellis, with a significantly reduced
complexity. It is then shown that the complexity can be further decreased by
projecting the state model on two state models of reduced size
Re-estimation of Lexical Parameters for Treebank PCFGs
We present procedures which pool lexical information estimated from unlabeled data via the Inside-Outside algorithm, with lexical information from a treebank PCFG. The procedures produce substantial improvements (up to 31.6 % error reduction) on the task of determining subcategorization frames of novel verbs, relative to a smoothed Penn Treebank-trained PCFG. Even with relatively small quantities of unlabeled training data, the re-estimated models show promising improvements in labeled bracketing f-scores on Wall Street Journal parsing, and substantial benefit in acquiring the subcategorization preferences of low-frequency verbs.
- …