9 research outputs found

    The challenges of statistical patterns of language: the case of Menzerath's law in genomes

    Get PDF
    The importance of statistical patterns of language has been debated over decades. Although Zipf's law is perhaps the most popular case, recently, Menzerath's law has begun to be involved. Menzerath's law manifests in language, music and genomes as a tendency of the mean size of the parts to decrease as the number of parts increases in many situations. This statistical regularity emerges also in the context of genomes, for instance, as a tendency of species with more chromosomes to have a smaller mean chromosome size. It has been argued that the instantiation of this law in genomes is not indicative of any parallel between language and genomes because (a) the law is inevitable and (b) non-coding DNA dominates genomes. Here mathematical, statistical and conceptual challenges of these criticisms are discussed. Two major conclusions are drawn: the law is not inevitable and languages also have a correlate of non-coding DNA. However, the wide range of manifestations of the law in and outside genomes suggests that the striking similarities between non-coding DNA and certain linguistics units could be anecdotal for understanding the recurrence of that statistical law.Comment: Title changed, abstract and introduction improved and little corrections on the statistical argument

    The parameters of Menzerath-Altmann law in genomes

    Get PDF
    The relationship between the size of the whole and the size of the parts in language and music is known to follow Menzerath-Altmann law at many levels of description (morphemes, words, sentences...). Qualitatively, the law states that larger the whole, the smaller its parts, e.g., the longer a word (in syllables) the shorter its syllables (in letters or phonemes). This patterning has also been found in genomes: the longer a genome (in chromosomes), the shorter its chromosomes (in base pairs). However, it has been argued recently that mean chromosome length is trivially a pure power function of chromosome number with an exponent of -1. The functional dependency between mean chromosome size and chromosome number in groups of organisms from three different kingdoms is studied. The fit of a pure power function yields exponents between -1.6 and 0.1. It is shown that an exponent of -1 is unlikely for fungi, gymnosperm plants, insects, reptiles, ray-finned fishes and amphibians. Even when the exponent is very close to -1, adding an exponential component is able to yield a better fit with regard to a pure power-law in plants, mammals, ray-finned fishes and amphibians. The parameters of Menzerath-Altmann law in genomes deviate significantly from a power law with a -1 exponent with the exception of birds and cartilaginous fishes.Comment: Typos and little inaccuracies corrected. Title and references updated (the previous update failed

    Parallels of human language in the behavior of bottlenose dolphins

    Full text link
    A short review of similarities between dolphins and humans with the help of quantitative linguistics and information theory

    The infochemical core

    Get PDF
    Vocalizations, and less often gestures, have been the object of linguistic research for decades. However, the development of a general theory of communication with human language as a particular case requires a clear understanding of the organization of communication through other means. Infochemicals are chemical compounds that carry information and are employed by small organisms that cannot emit acoustic signals of an optimal frequency to achieve successful communication. Here, we investigate the distribution of infochemicals across species when they are ranked by their degree or the number of species with which they are associated (because they produce them or are sensitive to them). We evaluate the quality of the fit of different functions to the dependency between degree and rank by means of a penalty for the number of parameters of the function. Surprisingly, a double Zipf (a Zipf distribution with two regimes, each with a different exponent) is the model yielding the best fit although it is the function with the largest number of parameters. This suggests that the worldwide repertoire of infochemicals contains a core which is shared by many species and is reminiscent of the core vocabularies found for human language in dictionaries or large corpora.Peer ReviewedPostprint (author's final draft

    Units and constituency in prosodic analysis:a quantitative assessment

    Get PDF
    Drawing on methods from quantitative linguistics, this paper tests the hypothesis that the intonation unit is a valid language construct whose immediate constituent is the foot (and whose own immediate constituent is the syllable). If the hypothesis is true, then the lengths of intonation units, measured in feet, should abide by a regular and parsimonious discrete probability distribution, and the immediate constituency relationship between feet and intonation units should be further demonstrable by successfully fitting the Menzerath-Altmann equation with a negative exponent. However, out of sixteen texts from the Aix-MARSEC database, only six share a common probability distribution and only eight exhibit a tolerable fit of the Menzerath-Altmann equation. A failure rate of ≥ 50% in both cases casts doubt on the validity of the hypothesis

    Quantitative linguistics and automatic text analysis. 1990 = Квантитативная лингвистика и автоматический анализ текстов

    Get PDF
    СОДЕРЖАНИЕ • Андреевская А.В. Квантитативное исследование полисемии корневых слов русского языка XI-XX веков. • Andreewskaya A.W. Russian XI-XX Centuries Root-words' Quantitative Analysis • Блехман М.С. Методы автоматической атрибуции документов: практические результаты . • Blekhman M.S. Some Methods of Automatic Text Attribution: Practical Results • Голубева-Монаткина Н.И. Статистические характеристики коммуникативных свойств вопросов и ответов русской диалогической речи • Golubeva-Monatkina M.I. Statistical Characteristics of Communication Properties of Questions and Answers of Russian Dialogic Speech. • Гороть Е.И. Изоморфные и отличительные черты морфемы м слога в распределении длины • Gorot, E.I. Isomorphous and Distinguishing Features of Morphemes and Syllables in Their Distribution according to Their Length. • Зубов А.В. Системы автоматизации научных исследований в филологии • Zubov A.Y. A System,of Automatic Scientific Research in Philology • Иванюк В.Ю. Левицкий В.В. Избирательность сочетания смыслов и возможные способы ее статистического выражения • Ivanyuk V.Yu., Levitsky V.V. The Selectivity of Sense Collocation and Possible Ways of Its Statistical Expression • Манасян Н.С. Еще раз о дифференциации типов английского научно-технического текста • Manasyan N. Once Again on the Differentiation of English Technological Texts. • Остапенко В.Е. Принципы формального решения проблемы соотношения между термином и словом • Ostapenko V.E. Principes de solution formelle du probleme de correlation entre un terme et un mot • Савчук С.О. О некоторых содержательных характеристиках стиля • Savchuk S.O. On Some Substantial Characteristics of Style • Чебанов С.В., Мартыненко Г.Я. Идеи герменевтики в прикладной лингвистике • Chebanov S.V., Martynenko G.Ya. Ideas of Hermeneutics in Applied Linguistics • Хроника. S u r v e y • Поликарпов А.А., Тулдава Ю.А. Всесоюзная конференция по компьютерной лингвистике Рецензия • Polikarpov A., Tuldava J. All-Union Conference on Computational Linguistics in Tartu (May 29-3% 1990) • Тулдава Ю. Рец. на кн.: G. Altmann, M.H. Schwibbe. Das Menzerathsche Gesetz in informationsverarbeitenden Systemen. Hildesheim; Zurich; New York: Georg 01ms Verlag, 1989 (Г. Альтманн, M.X Швиббе. Закон Менцерата в информационных системах). • Tuldava J. Review of: G. Altmann, M.H. Schwibbe. Das Menzerathsche Gesetz in informationsverarbeitenden Systemen / Mit Beitragen von W. Kaumanns, R. Kohler und J. Wilde. - Hildesheim; Zurich; New York: Georg 01ms Verlag, 1989 122http://tartu.ester.ee/record=b1079382~S1*es

    The Phylogeny and Function of Vocal Complexity in Geladas

    Full text link
    The complexity of vocal communication varies widely across taxa – from humans who can create an infinite repertoire of sound combinations to some non-human species that produce only a few discrete sounds. A growing body of research is aimed at understanding the origins of ‘vocal complexity’. And yet, we still understand little about the evolutionary processes that led to, and the selective advantages of engaging in, complex vocal behaviors. I contribute to this body of research by examining the phylogeny and function of vocal complexity in wild geladas (Theropithecus gelada), a primate known for its capacity to combine a suite of discrete sound types into varied sequences. First, I investigate the phylogeny of vocal complexity by comparing gelada vocal communication with that of their close baboon relatives and with humans. Comparisons of vocal repertoires reveal that geladas – specifically the males – produce a suite of unique or ‘derived’ call types that results in a more diversified vocal repertoire than baboons. Also, comparisons of acoustic properties reveal that geladas produce vocalizations with greater spectro-temporal modulation, a feature shared with human speech, than baboons. Additionally, I show that the same organizational principle – Menzerath’s law – underpins the structure of gelada vocal sequences (i.e., combinations of derived and homologous call types) and human sentences. Second, I investigate the function of vocal complexity by examining the perception of male complex vocal sequences (i.e., those with more derived call types), the contexts in which they are produced, and how their production differs across individuals. A playback experiment shows that female geladas perceive ‘complex’ and ‘simple’ vocal sequences as being different. Then, two observational studies show that male production of complex vocal sequences mediates their affiliative interactions with females, both during neutral periods and periods of uncertainty (e.g., following conflicts). Finally, I find evidence that vocal complexity can act as a signal of male ‘quality’, in that more dominant males exhibit higher levels of vocal complexity than their subordinate counterparts. Collectively, the work presented in this dissertation presents an integrative investigation of the ultimate origins of complex communication systems, and in the process, it highlights the critical importance of approaching the study of complexity from several scientific perspectives.PHDPsychologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138479/1/gustison_1.pd

    Analyzing and Improving Statistical Language Models for Speech Recognition

    Get PDF
    In many current speech recognizers, a statistical language model is used to indicate how likely it is that a certain word will be spoken next, given the words recognized so far. How can statistical language models be improved so that more complex speech recognition tasks can be tackled? Since the knowledge of the weaknesses of any theory often makes improving the theory easier, the central idea of this thesis is to analyze the weaknesses of existing statistical language models in order to subsequently improve them. To that end, we formally define a weakness of a statistical language model in terms of the logarithm of the total probability, LTP, a term closely related to the standard perplexity measure used to evaluate statistical language models. We apply our definition of a weakness to a frequently used statistical language model, called a bi-pos model. This results, for example, in a new modeling of unknown words which improves the performance of the model by 14% to 21%. Moreover, one of the identified weaknesses has prompted the development of our generalized N-pos language model, which is also outlined in this thesis. It can incorporate linguistic knowledge even if it extends over many words and this is not feasible in a traditional N-pos model. This leads to a discussion of whatknowledge should be added to statistical language models in general and we give criteria for selecting potentially useful knowledge. These results show the usefulness of both our definition of a weakness and of performing an analysis of weaknesses of statistical language models in general.Comment: 140 pages, postscript, approx 500KB, if problems with delivery, mail to [email protected]

    Finding structure in language

    Get PDF
    Since the Chomskian revolution, it has become apparent that natural language is richly structured, being naturally represented hierarchically, and requiring complex context sensitive rules to define regularities over these representations. It is widely assumed that the richness of the posited structure has strong nativist implications for mechanisms which might learn natural language, since it seemed unlikely that such structures could be derived directly from the observation of linguistic data (Chomsky 1965).This thesis investigates the hypothesis that simple statistics of a large, noisy, unlabelled corpus of natural language can be exploited to discover some of the structure which exists in natural language automatically. The strategy is to initially assume no knowledge of the structures present in natural language, save that they might be found by analysing statistical regularities which pertain between a word and the words which typically surround it in the corpus.To achieve this, various statistical methods are applied to define similarity between statistical distributions, and to infer a structure for a domain given knowledge of the similarities which pertain within it. Using these tools, it is shown that it is possible to form a hierarchical classification of many domains, including words in natural language. When this is done, it is shown that all the major syntactic categories can be obtained, and the classification is both relatively complete, and very much in accord with a standard linguistic conception of how words are classified in natural language.Once this has been done, the categorisation derived is used as the basis of a similar classification of short sequences of words. If these are analysed in a similar way, then several syntactic categories can be derived. These include simple noun phrases, various tensed forms of verbs, and simple prepositional phrases. Once this has been done, the same technique can be applied one level higher, and at this level simple sentences and verb phrases, as well as more complicated noun phrases and prepositional phrases, are shown to be derivable
    corecore