Search CORE

42,517 research outputs found

On vocabulary size of grammar-based codes

Author: Debowski Lukasz
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/04/2007
Field of study

We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog of algorithmic mutual information. The aim is to strengthen inequalities which were discussed in a weaker form in linguistics but shed some light on redundancy of efficiently computable codes. The main contribution of the paper is a construction of universal grammar-based codes for which the excess lengths can be bounded easily.Comment: 5 pages, accepted to ISIT 2007 and correcte

arXiv.org e-Print Archive

Crossref

On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

Author: Dębowski Łukasz
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/02/2011
Field of study

The article presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If a text of length

n

describes

n^\beta

independent facts in a repetitive way then the text contains at least

n^\beta/\log n

different words, under suitable conditions on

n

. In the formal statement, two modeling postulates are adopted. Firstly, the words are understood as nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the text is assumed to be emitted by a finite-energy strongly nonergodic source whereas the facts are binary IID variables predictable in a shift-invariant way.Comment: 24 pages, no figure

arXiv.org e-Print Archive

Crossref

On Hilberg's Law and Its Links with Guiraud's Law

Author: Altmann G.
Belevitch V.
Bell T. C.
Billingsley P.
Bod R.
De Marcken C. G.
Dębowski Ł.
Dębowski Ł.
Dębowski Ł.
Dębowski Ł.
Guiraud H.
Hoffmann L.
Jelinek F.
Kallenberg O.
Kornai A.
Lehman E.
Lehman E.
Li M.
Li W.
Mandelbrot B.
Mandelbrot B.
Manning C. D.
Megyesi B.
Menzerath P.
Montemurro M. A.
Nevill-Manning C.
Pareto V.
Petrova N. V.
Shalizi C. R.
Shannon C.
Upper D. R.
Wolff J. G.
Zipf G. K.
Zipf G. K.
Łukasz De¸bowski
Publication venue: 'Informa UK Limited'
Publication date: 07/07/2005
Field of study

Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg's hypothesis is true, we derive Guiraud's law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory and on several experiments suggesting that words can be defined approximately as the nonterminals of the shortest context-free grammar for the text. Such operational definition of words can be applied even to texts deprived of spaces, which do not allow for Mandelbrot's ``intermittent silence'' explanation of Zipf's and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic long-memory effects in human narration and might be capable of explaining Menzerath's law.Comment: To appear in Journal of Quantitative Linguistic

arXiv.org e-Print Archive

Crossref

Universal Indexes for Highly Repetitive Document Collections

Author: Claude Francisco
Fariña Antonio
Martínez-Prieto Miguel A.
Navarro Gonzalo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Académico de la Universidad de Chile

Operationalising and measuring language dominance

Author: Altarriba J.
Appel R.
Bachman L.
Bachman L.
Baetens Beardsmore H.
Baker C.
Berman R.A.
Chomsky N.
Cook V.
Dunn L.M.
Grosjean F.
Guiraud P.
Jeanine Treffers-Daller
Lanza E.
MacWhinney B.
Malvern D.D.
Mayer M.
Mayer M.
Meara P.
Myers-Scotton C.
Nation I.S.P.
Raven J.C.
Skutnab-Kangas T.
Publication venue: 'SAGE Publications'
Publication date: 01/01/2011
Field of study

The paper offers a new way to measure language ability in bilinguals, based on measures of lexical richness. The validity of proposed approach is tested in a variety of ways

Central Archive at the University of Reading

CiteSeerX

Crossref

UWE Bristol Research Repository

Detecting sequential structure

Author: Nevill-Manning Craig G.
Witten Ian H.
Publication venue
Publication date: 01/01/1995
Field of study

Programming by demonstration requires detection and analysis of sequential patterns in a user’s input, and the synthesis of an appropriate structural model that can be used for prediction. This paper describes SEQUITUR, a scheme for inducing a structural description of a sequence from a single example. SEQUITUR integrates several different inference techniques: identification of lexical subsequences or vocabulary elements, hierarchical structuring of such subsequences, identification of elements that have equivalent usage patterns, inference of programming constructs such as looping and branching, generalisation by unifying grammar rules, and the detection of procedural substructure., Although SEQUITUR operates with abstract sequences, a number of concrete illustrations are provided

Research Commons@Waikato

Rating scale development: a multistage exploratory sequential design

Author: Galaczi Evelina D.
Khabbazbashi Nahal
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/03/2016
Field of study

The project chosen to showcase the application of the exploratory sequential design in second/ foreign (L2) language assessment comes from the context of rating scale development and focuses on the development of a set of scales for a suite of high-stakes L2 speaking tests. The assessment of speaking requires assigning scores to a speech sample in a systematic fashion by focusing on explicitly defined criteria which describe different levels of performance (Ginther 2013). Rating scales are the instruments used in this evaluation process, and they can be either holistic (i.e. providing a global overall assessment) or analytic (i.e. providing an independent evaluations for a number of assessment criteria, e.g. Grammar, Vocabulary, Organisation, etc.). The discussion in this chapter is framed within the context of rating scales in speaking assessment. However, it is worth noting that the principles espoused, stages employed and decisions taken during the development process have wider applicability to performance assessment in general

University of Bedfordshire Repository

Re-Pair Compression of Inverted Lists

Author: Claude Francisco
Farina Antonio
Navarro Gonzalo
Publication venue
Publication date: 01/01/2009
Field of study

Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art

arXiv.org e-Print Archive

CiteSeerX

Repositorio Académico de la Universidad de Chile