741 research outputs found
Probabilistic grammatical model of protein language and its application to helix-helix contact site classification
BACKGROUND: Hidden Markov Models power many stateâofâtheâart tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on mediumâ and longârange residueâresidue interactions. This requires an expressive power of at least contextâfree grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problemâspecific protein languages and apply it to classification of transmembrane helixâhelix pairs configurations. The core of the model consists of a probabilistic contextâfree grammar, automatically inferred by a genetic algorithm from only a generic set of expertâbased rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helixâhelix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helixâhelix contact sites. CONCLUSIONS: We demonstrated that our probabilistic contextâfree framework for analysis of protein sequences outperforms the state of the art in the task of helixâhelix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are humanâreadable. Thus they could provide biologically meaningful information for molecular biologists
Calibrating Generative Models: The Probabilistic Chomsky-SchĂŒtzenberger Hierarchy
A probabilistic ChomskyâSchĂŒtzenberger hierarchy of grammars is introduced and studied, with the aim of understanding the expressive power of generative models. We offer characterizations of the distributions definable at each level of the hierarchy, including probabilistic regular, context-free, (linear) indexed, context-sensitive, and unrestricted grammars, each corresponding to familiar probabilistic machine classes. Special attention is given to distributions on (unary notations for) positive integers. Unlike in the classical case where the "semi-linear" languages all collapse into the regular languages, using analytic tools adapted from the classical setting we show there is no collapse in the probabilistic hierarchy: more distributions become definable at each level. We also address related issues such as closure under probabilistic conditioning
On the learning of vague languages for syntactic pattern recognition
The method of the learning of vague languages which represent distorted/ambiguous patterns is proposed in the paper. The goal of the method is to infer the quasi-context-sensitive string grammar which is used in our model as the generator of patterns. The method is an important component of the multi-derivational model of the parsing of vague languages used for syntactic pattern recognition
Learning Interactions of Local and Non-Local Phonotactic Constraints from Positive Input
This paper proposes a grammatical inference algorithm to learn input-sensitive tier-based strictly local languages across multiple tiers from positive data only, when the locality of the tier-constraints and the tier-projection function is set to 2 (MITSL; De Santo and Graf, 2019). We conduct simulations showing that the algorithm succeeds in learning MITSL patterns over a set of artificial languages
Feasible Learnability of Formal Grammars and the Theory of Natural Language Acquisition
We propose to apply a complexity theoretic notion of feasible learnability called polynomial learnability to the evaluation of grammatical formalisms for linguistic description. Polynomial learnability was originally defined by Valiant in the context of boolean concept learning and subsequently generalized by Blumer et al. to infinitary domains. We give a clear, intuitive exposition of this notion of learnability and what characteristics of a collection of languages may or may not help feasible learnability under this paradigm. In particular, we present a novel, nontrivial constraint on the degree of locality of grammars which allows a rich class of mildly context sensitive languages to be feasibly learnable. We discuss possible implications of this observation to the theory of natural language acquisition
Segmentation of Document Using Discriminative Context-free Grammar Inference and Alignment Similarities
Text Documents present a great challenge to the field of document recognition. Automatic segmentation and layout analysis of documents is used for interpretation and machine translation of documents. Document such as research papers, address book, news etc. is available in the form of un-structured format. Extracting relevant Knowledge from this document has been recognized as promising task. Extracting interesting rules form it is complex and tedious process. Conditional random fields (CRFs) utilizing contextual information, hand-coded wrappers to label the text (such as Name, Phone number and Address etc). In this paper we propose a novel approach to infer grammar rules using alignment similarity and discriminative context-free grammar. It helps in extracting desired information from the document.
DOI: 10.17762/ijritcc2321-8169.160410
- âŠ