2,985 research outputs found
An Abstract Machine for Unification Grammars
This work describes the design and implementation of an abstract machine,
Amalia, for the linguistic formalism ALE, which is based on typed feature
structures. This formalism is one of the most widely accepted in computational
linguistics and has been used for designing grammars in various linguistic
theories, most notably HPSG. Amalia is composed of data structures and a set of
instructions, augmented by a compiler from the grammatical formalism to the
abstract instructions, and a (portable) interpreter of the abstract
instructions. The effect of each instruction is defined using a low-level
language that can be executed on ordinary hardware.
The advantages of the abstract machine approach are twofold. From a
theoretical point of view, the abstract machine gives a well-defined
operational semantics to the grammatical formalism. This ensures that grammars
specified using our system are endowed with well defined meaning. It enables,
for example, to formally verify the correctness of a compiler for HPSG, given
an independent definition. From a practical point of view, Amalia is the first
system that employs a direct compilation scheme for unification grammars that
are based on typed feature structures. The use of amalia results in a much
improved performance over existing systems.
In order to test the machine on a realistic application, we have developed a
small-scale, HPSG-based grammar for a fragment of the Hebrew language, using
Amalia as the development platform. This is the first application of HPSG to a
Semitic language.Comment: Doctoral Thesis, 96 pages, many postscript figures, uses pstricks,
pst-node, psfig, fullname and a macros fil
Languages cool as they expand: Allometric scaling and the decreasing need for new words
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ââcooling patternââ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature
In the Beginning Was the Verb: The Emergence and Evolution of Language Problem in the Light of the Big Bang Epistemological Paradigm.
The enigma of the Emergence of Natural Languages, coupled or not with the closely related problem of their Evolution is perceived today as one of the most important scientific problems. \ud
The purpose of the present study is actually to outline such a solution to our problem which is epistemologically consonant with the Big Bang solution of the problem of the Emergence of the Universe}. Such an outline, however, becomes articulable, understandable, and workable only in a drastically extended epistemic and scientific oecumene, where known and habitual approaches to the problem, both theoretical and experimental, become distant, isolated, even if to some degree still hospitable conceptual and methodological islands. \ud
The guiding light of our inquiry will be Eugene Paul Wigner's metaphor of ``the unreasonable effectiveness of mathematics in natural sciences'', i.e., the steadily evolving before our eyes, since at least XVIIth century, \ud
``the miracle of the appropriateness of the language of mathematics for the formulation of the laws of physics''. Kurt Goedel's incompleteness and undecidability theory will be our guardian discerner against logical fallacies of otherwise apparently plausible explanations. \ud
John Bell's ``unspeakableness'' and the commonplace counterintuitive character of quantum phenomena will be our encouragers. And the radical novelty of the introduced here and adapted to our purposes Big Bang epistemological paradigm will be an appropriate, even if probably shocking response to our equally shocking discovery in the oldest among well preserved linguistic fossils of perfect mathematical structures outdoing the best artifactual Assemblers
Linguistic constraints on statistical word segmentation: The role of consonants in Arabic and English
Statistical learning is often taken to lie at the heart of many cognitive tasks, including the acquisition of language. One particular task in which probabilistic models have achieved considerable success is the segmentation of speech into words. However, these models have mostly been tested against English data, and as a result little is known about how a statistical learning mechanism copes with input regularities that arise from the structural properties of different languages. This study focuses on statistical word segmentation in Arabic, a Semitic language in which words are built around consonantal roots. We hypothesize that segmentation in such languages is facilitated by tracking consonant distributions independently from intervening vowels. Previous studies have shown that human learners can track consonant probabilities across intervening vowels in artificial languages, but it is unknown to what extent this ability would be beneficial in the segmentation of natural language. We assessed the performance of a Bayesian segmentation model on English and Arabic, comparing consonant-only representations with full representations. In addition, we examined to what extent structurally different proto-lexicons reflect adult language. The results suggest that for a child learning a Semitic language, separating consonants from vowels is beneficial for segmentation. These findings indicate that probabilistic models require appropriate linguistic representations in order to effectively meet the challenges of language acquisition
The "handedness" of language: Directional symmetry breaking of sign usage in words
Language, which allows complex ideas to be communicated through symbolic
sequences, is a characteristic feature of our species and manifested in a
multitude of forms. Using large written corpora for many different languages
and scripts, we show that the occurrence probability distributions of signs at
the left and right ends of words have a distinct heterogeneous nature.
Characterizing this asymmetry using quantitative inequality measures, viz.
information entropy and the Gini index, we show that the beginning of a word is
less restrictive in sign usage than the end. This property is not simply
attributable to the use of common affixes as it is seen even when only word
roots are considered. We use the existence of this asymmetry to infer the
direction of writing in undeciphered inscriptions that agrees with the
archaeological evidence. Unlike traditional investigations of phonotactic
constraints which focus on language-specific patterns, our study reveals a
property valid across languages and writing systems. As both language and
writing are unique aspects of our species, this universal signature may reflect
an innate feature of the human cognitive phenomenon.Comment: 10 pages, 4 figures + Supplementary Information (15 pages, 8
figures), final corrected versio
Amalia -- A Unified Platform for Parsing and Generation
Contemporary linguistic theories (in particular, HPSG) are declarative in
nature: they specify constraints on permissible structures, not how such
structures are to be computed. Grammars designed under such theories are,
therefore, suitable for both parsing and generation. However, practical
implementations of such theories don't usually support bidirectional processing
of grammars. We present a grammar development system that includes a compiler
of grammars (for parsing and generation) to abstract machine instructions, and
an interpreter for the abstract machine language. The generation compiler
inverts input grammars (designed for parsing) to a form more suitable for
generation. The compiled grammars are then executed by the interpreter using
one control strategy, regardless of whether the grammar is the original or the
inverted version. We thus obtain a unified, efficient platform for developing
reversible grammars.Comment: 8 pages postscrip
- âŠ