1,321 research outputs found

    Efficient learning of context-free grammars from positive structural examples

    Get PDF
    AbstractIn this paper, we introduce a new normal form for context-free grammars, called reversible context-free grammars, for the problem of learning context-free grammars from positive-only examples. A context-free grammar G = (N, Σ, P, S) is said to be reversible if (1) A → α and B → α in P implies A = B and (2) A → αBβ and A → αCβ in P implies B = C. We show that the class of reversible context-free grammars can be identified in the limit from positive samples of structural descriptions and there exists an efficient algorithm to identify them from positive samples of structural descriptions, where a structural description of a context-free grammar is an unlabelled derivation tree of the grammar. This implies that if positive structural examples of a reversible context-free grammar for the target language are available to the learning algorithm, the full class of context-free languages can be learned efficiently from positive samples

    Corpus based learning of stochastic, context-free grammars combined with Hidden Markov Models for tRNA modelling

    Full text link
    [EN] In this paper, a new method for modelling tRNA secondary structures is presented. This method is based on the combination of stochastic context-free grammars (SCFG) and Hidden Markov Models (HMM). HMM are used to capture the local relations in the loops of the molecule (nonstructured regions) and SCFG are used to capture the long term relations between nucleotides of the arms (structured regions). Given annotated public databases, the HMM and SCFG models are learned by means of automatic inductive learning methods. Two SCFG learning methods have been explored. Both of them take advantage of the structural information associated with the training sequences: one of them is based on a stochastic version of the Sakakibara algorithm and the other one is based on a Corpus based algorithm. A final model is then obtained by merging of the HMM of the nonstructured regions and the SCFG of the structured regions. Finally, the performed experiments on the tRNA sequence corpus and the non-tRNA sequence corpus give significant results. Comparative experiments with another published method are also presented.We would like to thank Diego Linares and Joan Andreu Sanchez for answering all our questions about SCFG, as well as Satoshi Sekine for his evaluation software. We would also like to thank the Ministerio de Sanidad y Consumo of Spain for the grants to the INBIOMED consortium.GarcĂ­a GĂłmez, JM.; BenedĂ­ Ruiz, JM.; Vicente Robledo, J.; Robles Viejo, M. (2005). Corpus based learning of stochastic, context-free grammars combined with Hidden Markov Models for tRNA modelling. International Journal of Bioinformatics Research and Applications. 1(3):305-318. doi:10.1504/IJBRA.2005.007908S3053181

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    Dynamic Protocol Reverse Engineering a Grammatical Inference Approach

    Get PDF
    Round trip engineering of software from source code and reverse engineering of software from binary files have both been extensively studied and the state-of-practice have documented tools and techniques. Forward engineering of protocols has also been extensively studied and there are firmly established techniques for generating correct protocols. While observation of protocol behavior for performance testing has been studied and techniques established, reverse engineering of protocol control flow from observations of protocol behavior has not received the same level of attention. State-of-practice in reverse engineering the control flow of computer network protocols is comprised of mostly ad hoc approaches. We examine state-of-practice tools and techniques used in three open source projects: Pidgin, Samba, and rdesktop . We examine techniques proposed by computational learning researchers for grammatical inference. We propose to extend the state-of-art by inferring protocol control flow using grammatical inference inspired techniques to reverse engineer automata representations from captured data flows. We present evidence that grammatical inference is applicable to the problem domain under consideration

    XRate: a fast prototyping, training and annotation tool for phylo-grammars

    Get PDF
    BACKGROUND: Recent years have seen the emergence of genome annotation methods based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists. RESULTS: We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic context-free grammars. xrate efficiently estimates maximum-likelihood parameters and phylogenetic trees using a novel "phylo-EM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures. CONCLUSION: Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools

    Toric grammars: a new statistical approach to natural language modeling

    Full text link
    We propose a new statistical model for computational linguistics. Rather than trying to estimate directly the probability distribution of a random sentence of the language, we define a Markov chain on finite sets of sentences with many finite recurrent communicating classes and define our language model as the invariant probability measures of the chain on each recurrent communicating class. This Markov chain, that we call a communication model, recombines at each step randomly the set of sentences forming its current state, using some grammar rules. When the grammar rules are fixed and known in advance instead of being estimated on the fly, we can prove supplementary mathematical properties. In particular, we can prove in this case that all states are recurrent states, so that the chain defines a partition of its state space into finite recurrent communicating classes. We show that our approach is a decisive departure from Markov models at the sentence level and discuss its relationships with Context Free Grammars. Although the toric grammars we use are closely related to Context Free Grammars, the way we generate the language from the grammar is qualitatively different. Our communication model has two purposes. On the one hand, it is used to define indirectly the probability distribution of a random sentence of the language. On the other hand it can serve as a (crude) model of language transmission from one speaker to another speaker through the communication of a (large) set of sentences

    Polynomial Learnability and Locality of Formal Grammars

    Get PDF
    We apply a complexity theoretic notion of feasible learnability called polynomial learnability to the evaluation of grammatical formalisms for linguistic description. We show that a novel, nontrivial constraint on the degree of locality of grammars allows not only context free languages but also a rich class of mildly context sensitive languages to be polynomially learnable. We discuss possible implications of this result to the theory of natural language acquisition

    Grammatical inference of directed acyclic graph languages with polynomial time complexity

    Get PDF
    [EN] In this paper we study the learning of graph languages. We extend the well-known classes of k-testability and k-testability in the strict sense languages to directed graph languages. We propose a grammatical inference algorithm to learn the class of directed acyclic k- testable in the strict sense graph languages. The algorithm runs in polynomial time and identifies this class of languages from positive data. We study its efficiency under several criteria, and perform a comprehensive experimentation with four datasets to show the validity of the method. Many fields, from pattern recognition to data compression, can take advantage of these results.Gallego, A.; LĂłpez RodrĂ­guez, D.; Calera-Rubio, J. (2018). Grammatical inference of directed acyclic graph languages with polynomial time complexity. Journal of Computer and System Sciences. 95:19-34. https://doi.org/10.1016/j.jcss.2017.12.002S19349
    • …
    corecore