48,086 research outputs found

    DNA ANALYSIS USING GRAMMATICAL INFERENCE

    Get PDF
    An accurate language definition capable of distinguishing between coding and non-coding DNA has important applications and analytical significance to the field of computational biology. The method proposed here uses positive sample grammatical inference and statistical information to infer languages for coding DNA. An algorithm is proposed for the searching of an optimal subset of input sequences for the inference of regular grammars by optimizing a relevant accuracy metric. The algorithm does not guarantee the finding of the optimal subset; however, testing shows improvement in accuracy and performance over the basis algorithm. Testing shows that the accuracy of inferred languages for components of DNA are consistently accurate. By using the proposed algorithm languages are inferred for coding DNA with average conditional probability over 80%. This reveals that languages for components of DNA can be inferred and are useful independent of the process that created them. These languages can then be analyzed or used for other tasks in computational biology. To illustrate potential applications of regular grammars for DNA components, an inferred language for exon sequences is applied as post processing to Hidden Markov exon prediction to reduce the number of wrong exons detected and improve the specificity of the model significantly

    Children as Models for Computers: Natural Language Acquisition for Machine Learning

    No full text
    International audienceThis paper focuses on a subfield of machine learning, the so- called grammatical inference. Roughly speaking, grammatical inference deals with the problem of inferring a grammar that generates a given set of sample sentences in some manner that is supposed to be realized by some inference algorithm. We discuss how the analysis and formalization of the main features of the process of human natural language acquisition may improve results in the area of grammatical inference

    Inducing Probabilistic Grammars by Bayesian Model Merging

    Full text link
    We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are {\em incorporated} by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are {\em merged} to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a trade-off between a close fit to the data and a default preference for simpler models (`Occam's Razor'). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, class-based nn-grams, and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13 page

    Learning SECp Languages from Only Positive Data

    Get PDF
    The eld of Grammatical Inference provides a good theoretical framework for investigating a learning process. Formal results in this eld can be relevant to the question of rst language acquisition. However, Grammatical Inference studies have been focused mainly on mathematical aspects, and have not exploited the linguistic relevance of their results. With this paper, we try to enrich Grammatical Inference studies with ideas from Linguistics. We propose a non-classical mechanism that has relevant linguistic and computational properties, and we study its learnability from positive data

    Clustering of word types and unification of word tokens into grammatical word-classes

    Get PDF
    This paper discusses Neopsy: unsupervised inference of grammatical word-classes in Natural Language. Grammatical inference can be divided into inference of grammatical word-classes and inference of structure. We review the background of supervised learning of Part-of-Speech tagging; and discuss the adaptation of the three main types of Part-of-Speech tagger to unsupervised inference of grammatical word-classes. Statistical N-gram taggers suggest a statistical clustering approach, but clustering does not help with low-frequency word-types, or with the many word-types which can appear in more than one grammatical category. The alternative Transformation-Based Learning tagger suggests a constraint-based approach of unification of word-token contexts. This offers a way to group together low-frequency word-types, and allows different tokens of one word-type to belong to different categories according to grammatical contexts they appear in. However, simple unification of word-token-contexts yields an implausibly large number of Part-of-Speech categories; we have attempted to merge more categories by "relaxing" matching context to allow unification of word-categories as well as word-tokens, but this results in spurious unifications. We conclude that the way ahead may be a hybrid involving clustering of frequent word-types, unification of word-token-contexts, and "seeding" with limited linguistic knowledge. We call for a programme of further research to develop a Language Discovery Toolkit

    Why languages differ : variation in the conventionalization of constraints on inference

    Get PDF
    Sperber and Wilson (1996) and Wilson and Sperber (1993) have argued that communication involves two processes, ostension and inference, but they also assume there is a coding-decoding stage of communication and a functional distinction between lexical items and grammatical marking (what they call 'conceptual' vs. 'procedural' information). Sperber and Wilson have accepted a basically Chomskyan view of the innateness of language structure and Universal Grammar

    A Grammatical Inference Approach to Language-Based Anomaly Detection in XML

    Full text link
    False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown automata (VPA) directly from a set of examples. The proposed algorithm does not require the tree representation of XML, so it can process large documents or streams. The resulting deterministic VPA then allows stream validation of documents to recognize deviations in the underlying tree structure or datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and Countermeasures ECTCM 201

    Search diversification techniques for grammatical inference

    Get PDF
    Grammatical Inference (GI) addresses the problem of learning a grammar G, from a finite set of strings generated by G. By using GI techniques we want to be able to learn relations between syntactically structured sequences. This process of inferring the target grammar G can easily be posed as a search problem through a lattice of possible solutions. The vast majority of research being carried out in this area focuses on non-monotonic searches, i.e. use the same heuristic function to perform a depth first search into the lattice until a hypothesis is chosen. EDSM and S-EDSM are prime examples of this technique. In this paper we discuss the introduction of diversification into our search space [5]. By introducing diversification through pairwise incompatible merges, we traverse multiple disjoint paths in the search lattice and obtain better results for the inference process.peer-reviewe
    corecore