567 research outputs found
DNA ANALYSIS USING GRAMMATICAL INFERENCE
An accurate language definition capable of distinguishing between coding and non-coding DNA has important applications and analytical significance to the field of computational biology. The method proposed here uses positive sample grammatical inference and statistical information to infer languages for coding DNA.
An algorithm is proposed for the searching of an optimal subset of input sequences for the inference of regular grammars by optimizing a relevant accuracy metric. The algorithm does not guarantee the finding of the optimal subset; however, testing shows improvement in accuracy and performance over the basis algorithm.
Testing shows that the accuracy of inferred languages for components of DNA are consistently accurate. By using the proposed algorithm languages are inferred for coding DNA with average conditional probability over 80%. This reveals that languages for components of DNA can be inferred and are useful independent of the process that created them. These languages can then be analyzed or used for other tasks in computational biology.
To illustrate potential applications of regular grammars for DNA components, an inferred language for exon sequences is applied as post processing to Hidden Markov exon prediction to reduce the number of wrong exons detected and improve the specificity of the model significantly
On the relevance of the neurobiological analogue of the finite-state architecture
We present two simple arguments for the potential relevance of a neurobiological analogue of the finite-state architecture. The first assumes the classical cognitive framework, is well-known, and is based on the assumption that the brain is finite with respect to its memory organization. The second is formulated within a general dynamical systems framework and is based on the assumption that the brain sustains some level of noise and/or does not utilize infinite precision processing. We briefly review the classical cognitive framework based on Church-Turing computability and non-classical approaches based on analog processing in dynamical systems. We conclude that the dynamical neurobiological analogue of the finite-state architecture appears to be relevant, at least at an implementational level, for cognitive brain systems
An exploration of language identification techniques for the Dutch folktale database
The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research
Statistical Learning of Arbitrary Computable Classifiers
Statistical learning theory chiefly studies restricted hypothesis classes,
particularly those with finite Vapnik-Chervonenkis (VC) dimension. The
fundamental quantity of interest is the sample complexity: the number of
samples required to learn to a specified level of accuracy. Here we consider
learning over the set of all computable labeling functions. Since the
VC-dimension is infinite and a priori (uniform) bounds on the number of samples
are impossible, we let the learning algorithm decide when it has seen
sufficient samples to have learned. We first show that learning in this setting
is indeed possible, and develop a learning algorithm. We then show, however,
that bounding sample complexity independently of the distribution is
impossible. Notably, this impossibility is entirely due to the requirement that
the learning algorithm be computable, and not due to the statistical nature of
the problem.Comment: Expanded the section on prior work and added reference
PAC Learning, VC Dimension, and the Arithmetic Hierarchy
We compute that the index set of PAC-learnable concept classes is
-complete within the set of indices for all concept classes of
a reasonable form. All concept classes considered are computable enumerations
of computable classes, in a sense made precise here. This family of
concept classes is sufficient to cover all standard examples, and also has the
property that PAC learnability is equivalent to finite VC dimension
- ā¦