Search CORE

323 research outputs found

On Hilberg's Law and Its Links with Guiraud's Law

Author: Altmann G.
Belevitch V.
Bell T. C.
Billingsley P.
Bod R.
De Marcken C. G.
Dębowski Ł.
Dębowski Ł.
Dębowski Ł.
Dębowski Ł.
Guiraud H.
Hoffmann L.
Jelinek F.
Kallenberg O.
Kornai A.
Lehman E.
Lehman E.
Li M.
Li W.
Mandelbrot B.
Mandelbrot B.
Manning C. D.
Megyesi B.
Menzerath P.
Montemurro M. A.
Nevill-Manning C.
Pareto V.
Petrova N. V.
Shalizi C. R.
Shannon C.
Upper D. R.
Wolff J. G.
Zipf G. K.
Zipf G. K.
Łukasz De¸bowski
Publication venue: 'Informa UK Limited'
Publication date: 07/07/2005
Field of study

Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg's hypothesis is true, we derive Guiraud's law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory and on several experiments suggesting that words can be defined approximately as the nonterminals of the shortest context-free grammar for the text. Such operational definition of words can be applied even to texts deprived of spaces, which do not allow for Mandelbrot's ``intermittent silence'' explanation of Zipf's and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic long-memory effects in human narration and might be capable of explaining Menzerath's law.Comment: To appear in Journal of Quantitative Linguistic

arXiv.org e-Print Archive

Crossref

Cumulative subject index

Author
Publication venue: 'Elsevier BV'
Publication date: 31/12/1978
Field of study

Elsevier - Publisher Connector

Learning Language from a Large (Unannotated) Corpus

Author: Goertzel Ben
Vepstas Linas
Publication venue
Publication date: 14/01/2014
Field of study

A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

arXiv.org e-Print Archive

CiteSeerX

Global Thresholding and Multiple Pass Parsing

Author: Goodman Joshua
Publication venue
Publication date: 01/01/1997
Field of study

We present a variation on classic beam thresholding techniques that is up to an order of magnitude faster than the traditional method, at the same performance level. We also present a new thresholding technique, global thresholding, which, combined with the new beam thresholding, gives an additional factor of two improvement, and a novel technique, multiple pass parsing, that can be combined with the others to yield yet another 50% improvement. We use a new search algorithm to simultaneously optimize the thresholding parameters of the various algorithms.Comment: Fixed latex errors; fixed minor errors in published versio

arXiv.org e-Print Archive

CiteSeerX

Computation of distances for regular and context-free probabilistic languages

Author: Nederhof Mark Jan
Satta Giorgio
Publication venue
Publication date: 01/01/2008
Field of study

Several mathematical distances between probabilistic languages have been investigated in the literature, motivated by applications in language modeling, computational biology, syntactic pattern matching and machine learning. In most cases, only pairs of probabilistic regular languages were considered. In this paper we extend the previous results to pairs of languages generated by a probabilistic context-free grammar and a probabilistic finite automaton.PostprintPeer reviewe

Elsevier - Publisher Connector

Crossref

Archivio istituzionale della ricerca - Università di Padova

University of St. Andrews - Pure

St Andrews Research Repository

Cumulative subject index volumes 33–35

Author
Publication venue: Published by Elsevier Inc.
Publication date
Field of study

Elsevier - Publisher Connector

Stochastic Attribute-Value Grammars

Author: Abney Steven
Publication venue
Publication date: 23/10/1996
Field of study

Probabilistic analogues of regular and context-free grammars are well-known in computational linguistics, and currently the subject of intensive research. To date, however, no satisfactory probabilistic analogue of attribute-value grammars has been proposed: previous attempts have failed to define a correct parameter-estimation algorithm. In the present paper, I define stochastic attribute-value grammars and give a correct algorithm for estimating their parameters. The estimation algorithm is adapted from Della Pietra, Della Pietra, and Lafferty (1995). To estimate model parameters, it is necessary to compute the expectations of certain functions under random fields. In the application discussed by Della Pietra, Della Pietra, and Lafferty (representing English orthographic constraints), Gibbs sampling can be used to estimate the needed expectations. The fact that attribute-value grammars generate constrained languages makes Gibbs sampling inapplicable, but I show how a variant of Gibbs sampling, the Metropolis-Hastings algorithm, can be used instead.Comment: 23 pages, 21 Postscript figures, uses rotate.st

arXiv.org e-Print Archive

CiteSeerX