24 research outputs found

    The infochemical core

    Get PDF
    Vocalizations, and less often gestures, have been the object of linguistic research for decades. However, the development of a general theory of communication with human language as a particular case requires a clear understanding of the organization of communication through other means. Infochemicals are chemical compounds that carry information and are employed by small organisms that cannot emit acoustic signals of an optimal frequency to achieve successful communication. Here, we investigate the distribution of infochemicals across species when they are ranked by their degree or the number of species with which they are associated (because they produce them or are sensitive to them). We evaluate the quality of the fit of different functions to the dependency between degree and rank by means of a penalty for the number of parameters of the function. Surprisingly, a double Zipf (a Zipf distribution with two regimes, each with a different exponent) is the model yielding the best fit although it is the function with the largest number of parameters. This suggests that the worldwide repertoire of infochemicals contains a core which is shared by many species and is reminiscent of the core vocabularies found for human language in dictionaries or large corpora.Peer ReviewedPostprint (author's final draft

    Preprocessing Algorithm for Deciphering Historical Inscriptions Using String Metric

    Get PDF
    The article presents the improvements in the preprocessing part of the deciphering method (shortly preprocessing algorithm) for historical inscriptions of unknown origin. Glyphs used in historical inscriptions changed through time; therefore, various versions of the same script may contain different glyphs for each grapheme. The purpose of the preprocessing algorithm is reducing the running time of the deciphering process by filtering out the less probable interpretations of the examined inscription. However, the first version of the preprocessing algorithm leads incorrect outcome or no result in the output in certain cases. Therefore, its improved version was developed to find the most similar words in the dictionary by relaying the search conditions more accurately, but still computationally effectively. Moreover, a sophisticated similarity metric used to determine the possible meaning of the unknown inscription is introduced. The results of the evaluations are also detailed

    Statistical analysis of the Indus script using nn-grams

    Get PDF
    The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilisation. Recently, some researchers have questioned the premise that the Indus script encodes spoken language. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically nn-gram Markov chains, to analyse the Indus script for syntax. Our main results are that the script has well-defined signs which begin and end texts, that there is directionality and strong correlations in the sign order, and that there are groups of signs which appear to have identical syntactic function. All these require no {\it a priori} suppositions regarding the syntactic or semantic content of the signs, but follow directly from the statistical analysis. Using information theoretic measures, we find the information in the script to be intermediate between that of a completely random and a completely fixed ordering of signs. Our study reveals that the Indus script is a structured sign system showing features of a formal language, but, at present, cannot conclusively establish that it encodes {\it natural} language. Our nn-gram Markov model is useful for predicting signs which are missing or illegible in a corpus of Indus texts. This work forms the basis for the development of a stochastic grammar which can be used to explore the syntax of the Indus script in greater detail

    Constant conditional entropy and related hypotheses

    Get PDF
    Constant entropy rate (conditional entropies must remain constant as the sequence length increases) and uniform information density (conditional probabilities must remain constant as the sequence length increases) are two information theoretic principles that are argued to underlie a wide range of linguistic phenomena. Here we revise the predictions of these principles in the light of Hilberg's law on the scaling of conditional entropy in language and related laws. We show that constant entropy rate (CER) and two interpretations for uniform information density (UID), full UID and strong UID, are inconsistent with these laws. Strong UID implies CER but the reverse is not true. Full UID, a particular case of UID, leads to costly uncorrelated sequences that are totally unrealistic. We conclude that CER and its particular cases are incomplete hypotheses about the scaling of conditional entropies.Peer ReviewedPostprint (author's final draft

    Boltzmann Complexity: An Emergent Property of the Majorization Partial Order

    Get PDF
    Boltzmann macrostates, which are in 1:1 correspondence with the partitions of integers, are investigated. Integer partitions, unlike entropy, uniquely characterize Boltzmann states, but their use has been limited. Integer partitions are well known to be partially ordered by majorization. It is less well known that this partial order is fundamentally equivalent to the “mixedness” of the set of microstates that comprise each macrostate. Thus, integer partitions represent the fundamental property of the mixing character of Boltzmann states. The standard definition of incomparability in partial orders is applied to the individual Boltzmann macrostates to determine the number of other macrostates with which it is incomparable. We apply this definition to each partition (or macrostate) and calculate the number C with which that partition is incomparable. We show that the value of C complements the value of the Boltzmann entropy, S, obtained in the usual way. Results for C and S are obtained for Boltzmann states comprised of up to N = 50 microstates where there are 204,226 Boltzmann macrostates. We note that, unlike mixedness, neither C nor S uniquely characterizes macrostates. Plots of C vs. S are shown. The results are surprising and support the authors’ earlier suggestion that C be regarded as the complexity of the Boltzmann states. From this we propose that complexity may generally arise from incomparability in other systems as well

    Unsupervised multilingual learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D

    Machine Learning Methods for the Analysis of Metagenomes

    Get PDF
    As of October 2020, there are 18.6 × 1015 DNA base pairs publicly available in the Sequence Read Archive and this number is growing at an exponential rate. As DNA sequencing prices continue to drop, many research groups around the world have incorporated high throughput sequencing in their research, giving us access to sequences from many distinct ecosystems. This has revolutionized the field of metagenomics, which aims to fully characterize all organisms and their interactions in a particular system. Nevertheless, the plethora of available data has made its analysis difficult as traditional techniques such as genome assembly or sequence alignment are bound to fail due to the high noise of metagenomes, or take an impractically long time due to their size. Through this thesis, we explore those challenges and develop techniques to meet them. Chapter 1 serves as an introduction to the fields of metagenomics and machine learning and the applications where the two meet. Chapter 2 examines the different kinds of noises in sequencing datasets and presents PRINSEQ++, a C++ multi-threaded software for quality control of sequencing datasets. Chapter 3 describes the analysis of 63 metagenomic samples from children with ”nodding syndrome” using Random Forest to give insights into the etiology of the disease. Chapter 4 explores the use of artificial neutral networks to classify phage structural proteins derived from metagenomes

    The entropy of words-learnability and expressivity across more than 1000 languages

    Get PDF
    The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.Peer ReviewedPostprint (published version
    corecore