741 research outputs found

    Probabilistic grammatical model of protein language and its application to helix-helix contact site classification

    Get PDF
    BACKGROUND: Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. CONCLUSIONS: We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists

    Calibrating Generative Models: The Probabilistic Chomsky-SchĂŒtzenberger Hierarchy

    Get PDF
    A probabilistic Chomsky–SchĂŒtzenberger hierarchy of grammars is introduced and studied, with the aim of understanding the expressive power of generative models. We offer characterizations of the distributions definable at each level of the hierarchy, including probabilistic regular, context-free, (linear) indexed, context-sensitive, and unrestricted grammars, each corresponding to familiar probabilistic machine classes. Special attention is given to distributions on (unary notations for) positive integers. Unlike in the classical case where the "semi-linear" languages all collapse into the regular languages, using analytic tools adapted from the classical setting we show there is no collapse in the probabilistic hierarchy: more distributions become definable at each level. We also address related issues such as closure under probabilistic conditioning

    On the learning of vague languages for syntactic pattern recognition

    Get PDF
    The method of the learning of vague languages which represent distorted/ambiguous patterns is proposed in the paper. The goal of the method is to infer the quasi-context-sensitive string grammar which is used in our model as the generator of patterns. The method is an important component of the multi-derivational model of the parsing of vague languages used for syntactic pattern recognition

    Learning Interactions of Local and Non-Local Phonotactic Constraints from Positive Input

    Get PDF
    This paper proposes a grammatical inference algorithm to learn input-sensitive tier-based strictly local languages across multiple tiers from positive data only, when the locality of the tier-constraints and the tier-projection function is set to 2 (MITSL; De Santo and Graf, 2019). We conduct simulations showing that the algorithm succeeds in learning MITSL patterns over a set of artificial languages

    Feasible Learnability of Formal Grammars and the Theory of Natural Language Acquisition

    Get PDF
    We propose to apply a complexity theoretic notion of feasible learnability called polynomial learnability to the evaluation of grammatical formalisms for linguistic description. Polynomial learnability was originally defined by Valiant in the context of boolean concept learning and subsequently generalized by Blumer et al. to infinitary domains. We give a clear, intuitive exposition of this notion of learnability and what characteristics of a collection of languages may or may not help feasible learnability under this paradigm. In particular, we present a novel, nontrivial constraint on the degree of locality of grammars which allows a rich class of mildly context sensitive languages to be feasibly learnable. We discuss possible implications of this observation to the theory of natural language acquisition

    Segmentation of Document Using Discriminative Context-free Grammar Inference and Alignment Similarities

    Get PDF
    Text Documents present a great challenge to the field of document recognition. Automatic segmentation and layout analysis of documents is used for interpretation and machine translation of documents. Document such as research papers, address book, news etc. is available in the form of un-structured format. Extracting relevant Knowledge from this document has been recognized as promising task. Extracting interesting rules form it is complex and tedious process. Conditional random fields (CRFs) utilizing contextual information, hand-coded wrappers to label the text (such as Name, Phone number and Address etc). In this paper we propose a novel approach to infer grammar rules using alignment similarity and discriminative context-free grammar. It helps in extracting desired information from the document. DOI: 10.17762/ijritcc2321-8169.160410
    • 

    corecore