18,300 research outputs found

    A Note on Zipf's Law, Natural Languages, and Noncoding DNA regions

    Get PDF
    In Phys. Rev. Letters (73:2, 5 Dec. 94), Mantegna et al. conclude on the basis of Zipf rank frequency data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue on the contrary that an empirical fit to Zipf's ``law'' cannot be used as a criterion for similarity to natural languages. Although DNA is a presumably an ``organized system of signs'' in Mandelbrot's (1961) sense, an observation of statistical features of the sort presented in the Mantegna et al. paper does not shed light on the similarity between DNA's ``grammar'' and natural language grammars, just as the observation of exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an MM sided die or a finite-state branching process.Comment: compressed uuencoded postscript file: 14 page

    Statistical Mechanical Approach to Human Language

    Full text link
    We use the formulation of equilibrium statistical mechanics in order to study some important characteristics of language. Using a simple expression for the Hamiltonian of a language system, which is directly implied by the Zipf law, we are able to explain several characteristic features of human language that seem completely unrelated, such as the universality of the Zipf exponent, the vocabulary size of children, the reduced communication abilities of people suffering from schizophrenia, etc. While several explanations are necessarily only qualitative at this stage, we have, nevertheless, been able to derive a formula for the vocabulary size of children as a function of age, which agrees rather well with experimental data.Comment: 20 pages,4 figures, Accepted in Physica

    What is Life?

    Get PDF
    In searching for life in extraterrestrial space, it is essential to act based on an unequivocal definition of life. In the twentieth century, life was defined as cells that self-replicate, metabolize, and are open for mutations, without which genetic information would remain unchangeable, and evolution would be impossible. Current definitions of life derive from statistical mechanics, physics, and chemistry of the twentieth century in which life is considered to function machine like, ignoring a central role of communication. Recent observations show that context-dependent meaningful communication and network formation (and control) are central to all life forms. Evolutionary relevant new nucleotide sequences now appear to have originated from social agents such as viruses, their parasitic relatives, and related RNA networks, not from errors. By applying the known features of natural languages and communication, a new twenty-first century definition of life can be reached in which communicative interactions are central to all processes of life. A new definition of life must integrate the current empirical knowledge about interactions between cells, viruses, and RNA networks to provide a better explanatory power than the twentieth century narrative

    Ciliate Gene Unscrambling with Fewer Templates

    Full text link
    One of the theoretical models proposed for the mechanism of gene unscrambling in some species of ciliates is the template-guided recombination (TGR) system by Prescott, Ehrenfeucht and Rozenberg which has been generalized by Daley and McQuillan from a formal language theory perspective. In this paper, we propose a refinement of this model that generates regular languages using the iterated TGR system with a finite initial language and a finite set of templates, using fewer templates and a smaller alphabet compared to that of the Daley-McQuillan model. To achieve Turing completeness using only finite components, i.e., a finite initial language and a finite set of templates, we also propose an extension of the contextual template-guided recombination system (CTGR system) by Daley and McQuillan, by adding an extra control called permitting contexts on the usage of templates.Comment: In Proceedings DCFS 2010, arXiv:1008.127

    A DNA Codification for Genetic Algorithms Simulation

    Get PDF
    In this paper we propose a model of encoding data into DNA strands so that this data can be used in the simulation of a genetic algorithm based on molecular operations. DNA computing is an impressive computational model that needs algorithms to work properly and efficiently. The first problem when trying to apply an algorithm in DNA computing must be how to codify the data that the algorithm will use. In a genetic algorithm the first objective must be to codify the genes, which are the main data. A concrete encoding of the genes in a single DNA strand is presented and we discuss what this codification is suitable for. Previous work on DNA coding defined bond-free languages which several properties assuring the stability of any DNA word of such a language. We prove that a bond-free language is necessary but not sufficient to codify a gene giving the correct codification

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure
    • 

    corecore