1,628 research outputs found

    Learning Chomsky-like grammars for biological sequence families

    Get PDF
    This paper presents a new method of measuring performance when positives are rare and investigates whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). As far as these authors are aware, this is both the first biological grammar learnt using ILP and the first real-world scientific application of the positive-only learning framework of CProgol. Performance is measured using both predictive accuracy and a new cost function, em Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. The highest RA was achieved by a model which includes grammar-derived features. This RA is significantly higher than the best RA achieved without the use of the grammar-derived features

    A sequence-length sensitive approach to learning biological grammars using inductive logic programming.

    Get PDF
    This thesis aims to investigate if the ideas behind compression principles, such as the Minimum Description Length, can help us to improve the process of learning biological grammars from protein sequences using Inductive Logic Programming (ILP). Contrary to most traditional ILP learning problems, biological sequences often have a high variation in their length. This variation in length is an important feature of biological sequences which should not be ignored by ILP systems. However we have identified that some ILP systems do not take into account the length of examples when evaluating their proposed hypotheses. During the learning process, many ILP systems use clause evaluation functions to assign a score to induced hypotheses, estimating their quality and effectively influencing the search. Traditionally, clause evaluation functions do not take into account the length of the examples which are covered by the clause. We propose L-modification, a way of modifying existing clause evaluation functions so that they take into account the length of the examples which they learn from. An empirical study was undertaken to investigate if significant improvements can be achieved by applying L-modification to a standard clause evaluation function. Furthermore, we generally investigated how ILP systems cope with the length of examples in training data. We show that our L-modified clause evaluation function outperforms our benchmark function in every experiment we conducted and thus we prove that L-modification is a useful concept. We also show that the length of the examples in the training data used by ILP systems does have an undeniable impact on the results

    The Biological Nature of Human Language

    Get PDF
    Biolinguistics aims to shed light on the specifically biological nature of human language, focusing on five foundational questions: (1) What are the properties of the language phenotype? (2) How does language ability grow and mature in individuals? (3) How is language put to use? (4) How is language implemented in the brain? (5) What evolutionary processes led to the emergence of language? These foundational questions are used here to frame a discussion of important issues in the study of language, exploring whether our linguistic capacity is the result of direct selective pressure or due to developmental or biophysical constraints, and assessing whether the neural/computational components entering into language are unique to human language or shared with other cognitive systems, leading to a discussion of advances in theoretical linguistics, psycholinguistics, comparative animal behavior and psychology, genetics/genomics, disciplines that can now place these longstanding questions in a new light, while raising challenges for future research

    A note on retrodiction and machine evolution

    Full text link
    Biomolecular communication demands that interactions between parts of a molecular system act as scaffolds for message transmission. It also requires an evolving and organized system of signs - a communicative agency - for creating and transmitting meaning. Here I explore the need to dissect biomolecular communication with retrodiction approaches that make claims about the past given information that is available in the present. While the passage of time restricts the explanatory power of retrodiction, the use of molecular structure in biology offsets information erosion. This allows description of the gradual evolutionary rise of structural and functional innovations in RNA and proteins. The resulting chronologies can also describe the gradual rise of molecular machines of increasing complexity and computation capabilities. For example, the accretion of rRNA substructures and ribosomal proteins can be traced in time and placed within a geological timescale. Phylogenetic, algorithmic and theoretical-inspired accretion models can be reconciled into a congruent evolutionary model. Remarkably, the time of origin of enzymes, functional RNA, non-ribosomal peptide synthetase (NRPS) complexes, and ribosomes suggest they gradually climbed Chomsky's hierarchy of formal grammars, supporting the gradual complexification of machines and communication in molecular biology. Future retrodiction approaches and in-depth exploration of theoretical models of computation will need to confirm such evolutionary progression.Comment: 7 pages, 1 figur

    Derivation of Context-free Stochastic L-Grammar Rules for Promoter Sequence Modeling Using Support Vector Machine

    Get PDF
    Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences
    corecore