29,335 research outputs found
Toric grammars: a new statistical approach to natural language modeling
We propose a new statistical model for computational linguistics. Rather than
trying to estimate directly the probability distribution of a random sentence
of the language, we define a Markov chain on finite sets of sentences with many
finite recurrent communicating classes and define our language model as the
invariant probability measures of the chain on each recurrent communicating
class. This Markov chain, that we call a communication model, recombines at
each step randomly the set of sentences forming its current state, using some
grammar rules. When the grammar rules are fixed and known in advance instead of
being estimated on the fly, we can prove supplementary mathematical properties.
In particular, we can prove in this case that all states are recurrent states,
so that the chain defines a partition of its state space into finite recurrent
communicating classes. We show that our approach is a decisive departure from
Markov models at the sentence level and discuss its relationships with Context
Free Grammars. Although the toric grammars we use are closely related to
Context Free Grammars, the way we generate the language from the grammar is
qualitatively different. Our communication model has two purposes. On the one
hand, it is used to define indirectly the probability distribution of a random
sentence of the language. On the other hand it can serve as a (crude) model of
language transmission from one speaker to another speaker through the
communication of a (large) set of sentences
CHR as grammar formalism. A first report
Grammars written as Constraint Handling Rules (CHR) can be executed as
efficient and robust bottom-up parsers that provide a straightforward,
non-backtracking treatment of ambiguity. Abduction with integrity constraints
as well as other dynamic hypothesis generation techniques fit naturally into
such grammars and are exemplified for anaphora resolution, coordination and
text interpretation.Comment: 12 pages. Presented at ERCIM Workshop on Constraints, Prague, Czech
Republic, June 18-20, 200
On vocabulary size of grammar-based codes
We discuss inequalities holding between the vocabulary size, i.e., the number
of distinct nonterminal symbols in a grammar-based compression for a string,
and the excess length of the respective universal code, i.e., the code-based
analog of algorithmic mutual information. The aim is to strengthen inequalities
which were discussed in a weaker form in linguistics but shed some light on
redundancy of efficiently computable codes. The main contribution of the paper
is a construction of universal grammar-based codes for which the excess lengths
can be bounded easily.Comment: 5 pages, accepted to ISIT 2007 and correcte
Formal Properties of XML Grammars and Languages
XML documents are described by a document type definition (DTD). An
XML-grammar is a formal grammar that captures the syntactic features of a DTD.
We investigate properties of this family of grammars. We show that every
XML-language basically has a unique XML-grammar. We give two characterizations
of languages generated by XML-grammars, one is set-theoretic, the other is by a
kind of saturation property. We investigate decidability problems and prove
that some properties that are undecidable for general context-free languages
become decidable for XML-languages. We also characterize those XML-grammars
that generate regular XML-languages.Comment: 24 page
CHR Grammars
A grammar formalism based upon CHR is proposed analogously to the way
Definite Clause Grammars are defined and implemented on top of Prolog. These
grammars execute as robust bottom-up parsers with an inherent treatment of
ambiguity and a high flexibility to model various linguistic phenomena. The
formalism extends previous logic programming based grammars with a form of
context-sensitive rules and the possibility to include extra-grammatical
hypotheses in both head and body of grammar rules. Among the applications are
straightforward implementations of Assumption Grammars and abduction under
integrity constraints for language analysis. CHR grammars appear as a powerful
tool for specification and implementation of language processors and may be
proposed as a new standard for bottom-up grammars in logic programming.
To appear in Theory and Practice of Logic Programming (TPLP), 2005Comment: 36 pp. To appear in TPLP, 200
On Hilberg's Law and Its Links with Guiraud's Law
Hilberg (1990) supposed that finite-order excess entropy of a random human
text is proportional to the square root of the text length. Assuming that
Hilberg's hypothesis is true, we derive Guiraud's law, which states that the
number of word types in a text is greater than proportional to the square root
of the text length. Our derivation is based on some mathematical conjecture in
coding theory and on several experiments suggesting that words can be defined
approximately as the nonterminals of the shortest context-free grammar for the
text. Such operational definition of words can be applied even to texts
deprived of spaces, which do not allow for Mandelbrot's ``intermittent
silence'' explanation of Zipf's and Guiraud's laws. In contrast to
Mandelbrot's, our model assumes some probabilistic long-memory effects in human
narration and might be capable of explaining Menzerath's law.Comment: To appear in Journal of Quantitative Linguistic
Languages, machines, and classical computation
3rd ed, 2021. A circumscription of the classical theory of computation building up from the Chomsky hierarchy. With the usual topics in formal language and automata theory
Message-Passing Protocols for Real-World Parsing -- An Object-Oriented Model and its Preliminary Evaluation
We argue for a performance-based design of natural language grammars and
their associated parsers in order to meet the constraints imposed by real-world
NLP. Our approach incorporates declarative and procedural knowledge about
language and language use within an object-oriented specification framework. We
discuss several message-passing protocols for parsing and provide reasons for
sacrificing completeness of the parse in favor of efficiency based on a
preliminary empirical evaluation.Comment: 12 pages, uses epsfig.st
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
- …