48,086 research outputs found
DNA ANALYSIS USING GRAMMATICAL INFERENCE
An accurate language definition capable of distinguishing between coding and non-coding DNA has important applications and analytical significance to the field of computational biology. The method proposed here uses positive sample grammatical inference and statistical information to infer languages for coding DNA.
An algorithm is proposed for the searching of an optimal subset of input sequences for the inference of regular grammars by optimizing a relevant accuracy metric. The algorithm does not guarantee the finding of the optimal subset; however, testing shows improvement in accuracy and performance over the basis algorithm.
Testing shows that the accuracy of inferred languages for components of DNA are consistently accurate. By using the proposed algorithm languages are inferred for coding DNA with average conditional probability over 80%. This reveals that languages for components of DNA can be inferred and are useful independent of the process that created them. These languages can then be analyzed or used for other tasks in computational biology.
To illustrate potential applications of regular grammars for DNA components, an inferred language for exon sequences is applied as post processing to Hidden Markov exon prediction to reduce the number of wrong exons detected and improve the specificity of the model significantly
Children as Models for Computers: Natural Language Acquisition for Machine Learning
International audienceThis paper focuses on a subfield of machine learning, the so- called grammatical inference. Roughly speaking, grammatical inference deals with the problem of inferring a grammar that generates a given set of sample sentences in some manner that is supposed to be realized by some inference algorithm. We discuss how the analysis and formalization of the main features of the process of human natural language acquisition may improve results in the area of grammatical inference
Inducing Probabilistic Grammars by Bayesian Model Merging
We describe a framework for inducing probabilistic grammars from corpora of
positive samples. First, samples are {\em incorporated} by adding ad-hoc rules
to a working grammar; subsequently, elements of the model (such as states or
nonterminals) are {\em merged} to achieve generalization and a more compact
representation. The choice of what to merge and when to stop is governed by the
Bayesian posterior probability of the grammar given the data, which formalizes
a trade-off between a close fit to the data and a default preference for
simpler models (`Occam's Razor'). The general scheme is illustrated using three
types of probabilistic grammars: Hidden Markov models, class-based -grams,
and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second
International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13
page
Learning SECp Languages from Only Positive Data
The eld of Grammatical Inference provides a good theoretical framework for investigating a learning process. Formal results in this eld can be relevant to the question of rst language acquisition. However, Grammatical Inference studies have been focused mainly on mathematical aspects, and have not exploited the linguistic relevance of their results. With this paper, we try to enrich Grammatical Inference studies with ideas from Linguistics. We propose a non-classical mechanism that has relevant linguistic and computational properties, and we study its learnability from positive data
Clustering of word types and unification of word tokens into grammatical word-classes
This paper discusses Neopsy: unsupervised inference of grammatical word-classes in Natural Language. Grammatical inference can be divided into inference of grammatical word-classes and inference of structure. We review the background of supervised learning of Part-of-Speech tagging; and discuss the adaptation of the three main types of Part-of-Speech tagger to unsupervised inference of grammatical word-classes. Statistical N-gram taggers suggest a statistical clustering approach, but clustering does not help with low-frequency word-types, or with the many word-types which can appear in more than one grammatical category. The alternative Transformation-Based Learning tagger suggests a constraint-based approach of unification of word-token contexts. This offers a way to group together low-frequency word-types, and allows different tokens of one word-type to belong to different categories according to grammatical contexts they appear in. However, simple unification of word-token-contexts yields an implausibly large number of Part-of-Speech categories; we have attempted to merge more categories by "relaxing" matching context to allow unification of word-categories as well as word-tokens, but this results in spurious unifications. We conclude that the way ahead may be a hybrid involving clustering of frequent word-types, unification of word-token-contexts, and "seeding" with limited linguistic knowledge. We call for a programme of further research to develop a Language Discovery Toolkit
Why languages differ : variation in the conventionalization of constraints on inference
Sperber and Wilson (1996) and Wilson and Sperber (1993) have argued that communication involves two processes, ostension and inference, but they also assume there is a coding-decoding stage of communication and a functional distinction between lexical items and grammatical marking (what they call 'conceptual' vs. 'procedural' information). Sperber and Wilson have accepted a basically Chomskyan view of the innateness of language structure and Universal Grammar
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
Search diversification techniques for grammatical inference
Grammatical Inference (GI) addresses the problem of learning a grammar G, from a finite set of strings generated by G. By using GI techniques we want to be able to learn relations between syntactically structured sequences. This process of inferring the target grammar G can easily be posed as a search problem through a lattice of possible solutions. The vast majority of research being carried out in this area focuses on non-monotonic searches, i.e. use the same heuristic function to perform a depth first search into the lattice until a hypothesis is chosen. EDSM and S-EDSM are prime examples of this technique. In this paper we discuss the introduction of diversification into our search space [5]. By introducing diversification through pairwise incompatible merges, we traverse multiple disjoint paths in the search lattice and obtain better results for the inference process.peer-reviewe
- …