8,160 research outputs found
Toric grammars: a new statistical approach to natural language modeling
We propose a new statistical model for computational linguistics. Rather than
trying to estimate directly the probability distribution of a random sentence
of the language, we define a Markov chain on finite sets of sentences with many
finite recurrent communicating classes and define our language model as the
invariant probability measures of the chain on each recurrent communicating
class. This Markov chain, that we call a communication model, recombines at
each step randomly the set of sentences forming its current state, using some
grammar rules. When the grammar rules are fixed and known in advance instead of
being estimated on the fly, we can prove supplementary mathematical properties.
In particular, we can prove in this case that all states are recurrent states,
so that the chain defines a partition of its state space into finite recurrent
communicating classes. We show that our approach is a decisive departure from
Markov models at the sentence level and discuss its relationships with Context
Free Grammars. Although the toric grammars we use are closely related to
Context Free Grammars, the way we generate the language from the grammar is
qualitatively different. Our communication model has two purposes. On the one
hand, it is used to define indirectly the probability distribution of a random
sentence of the language. On the other hand it can serve as a (crude) model of
language transmission from one speaker to another speaker through the
communication of a (large) set of sentences
Stochastic Attribute-Value Grammars
Probabilistic analogues of regular and context-free grammars are well-known
in computational linguistics, and currently the subject of intensive research.
To date, however, no satisfactory probabilistic analogue of attribute-value
grammars has been proposed: previous attempts have failed to define a correct
parameter-estimation algorithm.
In the present paper, I define stochastic attribute-value grammars and give a
correct algorithm for estimating their parameters. The estimation algorithm is
adapted from Della Pietra, Della Pietra, and Lafferty (1995). To estimate model
parameters, it is necessary to compute the expectations of certain functions
under random fields. In the application discussed by Della Pietra, Della
Pietra, and Lafferty (representing English orthographic constraints), Gibbs
sampling can be used to estimate the needed expectations. The fact that
attribute-value grammars generate constrained languages makes Gibbs sampling
inapplicable, but I show how a variant of Gibbs sampling, the
Metropolis-Hastings algorithm, can be used instead.Comment: 23 pages, 21 Postscript figures, uses rotate.st
Context-Free Path Querying with Structural Representation of Result
Graph data model and graph databases are very popular in various areas such
as bioinformatics, semantic web, and social networks. One specific problem in
the area is a path querying with constraints formulated in terms of formal
grammars. The query in this approach is written as grammar, and paths querying
is graph parsing with respect to given grammar. There are several solutions to
it, but how to provide structural representation of query result which is
practical for answer processing and debugging is still an open problem. In this
paper we propose a graph parsing technique which allows one to build such
representation with respect to given grammar in polynomial time and space for
arbitrary context-free grammar and graph. Proposed algorithm is based on
generalized LL parsing algorithm, while previous solutions are based mostly on
CYK or Earley algorithms, which reduces time complexity in some cases.Comment: Evaluation extende
Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning
Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk. Learning from data is central to contemporary computational linguistics. It is in common in such learning to estimate a model in a parametric family using the maximum likelihood principle. This principle applies in the supervised case (i.e., using annotate
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech
We describe a statistical approach for modeling dialogue acts in
conversational speech, i.e., speech-act-like units such as Statement, Question,
Backchannel, Agreement, Disagreement, and Apology. Our model detects and
predicts dialogue acts based on lexical, collocational, and prosodic cues, as
well as on the discourse coherence of the dialogue act sequence. The dialogue
model is based on treating the discourse structure of a conversation as a
hidden Markov model and the individual dialogue acts as observations emanating
from the model states. Constraints on the likely sequence of dialogue acts are
modeled via a dialogue act n-gram. The statistical dialogue grammar is combined
with word n-grams, decision trees, and neural networks modeling the
idiosyncratic lexical and prosodic manifestations of each dialogue act. We
develop a probabilistic integration of speech recognition with dialogue
modeling, to improve both speech recognition and dialogue act classification
accuracy. Models are trained and evaluated using a large hand-labeled database
of 1,155 conversations from the Switchboard corpus of spontaneous
human-to-human telephone speech. We achieved good dialogue act labeling
accuracy (65% based on errorful, automatically recognized words and prosody,
and 71% based on word transcripts, compared to a chance baseline accuracy of
35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling
changed
- …