105 research outputs found

    Efficient learning of context-free grammars from positive structural examples

    Get PDF
    AbstractIn this paper, we introduce a new normal form for context-free grammars, called reversible context-free grammars, for the problem of learning context-free grammars from positive-only examples. A context-free grammar G = (N, Σ, P, S) is said to be reversible if (1) A → α and B → α in P implies A = B and (2) A → αBβ and A → αCβ in P implies B = C. We show that the class of reversible context-free grammars can be identified in the limit from positive samples of structural descriptions and there exists an efficient algorithm to identify them from positive samples of structural descriptions, where a structural description of a context-free grammar is an unlabelled derivation tree of the grammar. This implies that if positive structural examples of a reversible context-free grammar for the target language are available to the learning algorithm, the full class of context-free languages can be learned efficiently from positive samples

    Learning context-free grammars from structural data in polynomial time

    Get PDF
    AbstractWe consider the problem of learning a context-free grammar from its structural descriptions. Structural descriptions of a context-free grammar are unlabelled derivation trees of the grammar. We present an efficient algorithm for learning context-free grammars using two types of queries: structural equivalence queries and structural membership queries. The learning protocol is based on what is called “minimally adequate teacher”, and it is shown that a grammar learned by the algorithm is not only a correct grammar, i.e. equivalent to the unknown grammar but also structurally equivalent to it. Furthermore, the algorithm runs in time polynomial in the number of states of the minimum frontier-to-root tree automaton for the set of structural descriptions of the unknown grammar and the maximum size of any counter-example returned by a structural equivalence query

    Improved Measurements of RNA Structure Conservation with Generalized Centroid Estimators

    Get PDF
    Identification of non-protein-coding RNAs (ncRNAs) in genomes is a crucial task for not only molecular cell biology but also bioinformatics. Secondary structures of ncRNAs are employed as a key feature of ncRNA analysis since biological functions of ncRNAs are deeply related to their secondary structures. Although the minimum free energy (MFE) structure of an RNA sequence is regarded as the most stable structure, MFE alone could not be an appropriate measure for identifying ncRNAs since the free energy is heavily biased by the nucleotide composition. Therefore, instead of MFE itself, several alternative measures for identifying ncRNAs have been proposed such as the structure conservation index (SCI) and the base pair distance (BPD), both of which employ MFE structures. However, these measurements are unfortunately not suitable for identifying ncRNAs in some cases including the genome-wide search and incur high false discovery rate. In this study, we propose improved measurements based on SCI and BPD, applying generalized centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that our proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the centroid-based SCI on CLUSTAL W alignments is more accurate than or comparable with that of the original SCI on structural alignments generated with RAF, a high quality structural aligner, for which twofold expensive computational time is required on average. We conclude that our methods are more suitable for genome-wide alignments which are of low quality from the point of view on secondary structures than the original SCI and BPD

    On case-based learnability of languages

    Get PDF
    Case-based reasoning is deemed an important technology to alleviate the bottleneck of knowledge acquisition in Artificial Intelligence (AI). In case-based reasoning, knowledge is represented in the form of particular cases with an appropriate similarity measure rather than any form of rules. The case-based reasoning paradigm adopts the view that an Al system is dynamically changing during its life-cycle which immediately leads to learning considerations. Within the present paper, we investigate the problem of case-based learning of indexable classes of formal languages. Prior to learning considerations, we study the problem of case-based representability and show that every indexable class is case-based representable with respect to a fixed similarity measure. Next, we investigate several models of case-based learning and systematically analyze their strengths as well as their limitations. Finally, the general approach to case-based learnability of indexable classes of formal languages is prototypically applied to so-called containmet decision lists, since they seem particularly tailored to case-based knowledge processing

    Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Bacillus subtilis </it>natto is closely related to the laboratory standard strain <it>B. subtilis </it>Marburg 168, and functions as a starter for the production of the traditional Japanese food "natto" made from soybeans. Although re-sequencing whole genomes of several laboratory domesticated <it>B. subtilis </it>168 derivatives has already been attempted using short read sequencing data, the assembly of the whole genome sequence of a closely related strain, <it>B. subtilis </it>natto, from very short read data is more challenging, particularly with our aim to assemble one fully connected scaffold from short reads around 35 bp in length.</p> <p>Results</p> <p>We applied a comparative genome assembly method, which combines <it>de novo </it>assembly and reference guided assembly, to one of the <it>B. subtilis </it>natto strains. We successfully assembled 28 scaffolds and managed to avoid substantial fragmentation. Completion of the assembly through long PCR experiments resulted in one connected scaffold for <it>B. subtilis </it>natto. Based on the assembled genome sequence, our orthologous gene analysis between natto BEST195 and Marburg 168 revealed that 82.4% of 4375 predicted genes in BEST195 are one-to-one orthologous to genes in 168, with two genes in-paralog, 3.2% are deleted in 168, 14.3% are inserted in BEST195, and 5.9% of genes present in 168 are deleted in BEST195. The natto genome contains the same alleles in the promoter region of <it>degQ </it>and the coding region of <it>swrAA </it>as the wild strain, RO-FF-1.</p> <p>These are specific for γ-PGA production ability, which is related to natto production. Further, the <it>B. subtilis </it>natto strain completely lacked a polyketide synthesis operon, disrupted the plipastatin production operon, and possesses previously unidentified transposases.</p> <p>Conclusions</p> <p>The determination of the whole genome sequence of <it>Bacillus subtilis </it>natto provided detailed analyses of a set of genes related to natto production, demonstrating the number and locations of insertion sequences that <it>B. subtilis </it>natto harbors but <it>B. subtilis </it>168 lacks. Multiple genome-level comparisons among five closely related <it>Bacillus </it>species were also carried out. The determined genome sequence of <it>B. subtilis </it>natto and gene annotations are available from the Natto genome browser <url>http://natto-genome.org/</url>.</p

    Construction of a genetic AND gate under a new standard for assembly of genetic parts

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Appropriate regulation of respective gene expressions is a bottleneck for the realization of artificial biological systems inside living cells. The modification of several promoter sequences is required to achieve appropriate regulation of the systems. However, a time-consuming process is required for the insertion of an operator, a binding site of a protein for gene expression, to the gene regulatory region of a plasmid. Thus, a standardized method for integrating operator sequences to the regulatory region of a plasmid is required.</p> <p>Results</p> <p>We developed a standardized method for integrating operator sequences to the regulatory region of a plasmid and constructed a synthetic promoter that functions as a genetic AND gate. By standardizing the regulatory region of a plasmid and the operator parts, we established a platform for modular assembly of the operator parts. Moreover, by assembling two different operator parts on the regulatory region, we constructed a regulatory device with an AND gate function.</p> <p>Conclusions</p> <p>We implemented a new standard to assemble operator parts for construction of functional genetic logic gates. The logic gates at the molecular scale have important implications for reprogramming cellular behavior.</p

    The Dugesia ryukyuensis Database as a Molecular Resource for Studying Switching of the Reproductive System

    Get PDF
    The planarian Dugesia ryukyuensis reproduces both asexually and sexually, and can switch from one mode of reproduction to the other. We recently developed a method for experimentally switching reproduction of the planarian from the asexual to the sexual mode. We constructed a cDNA library from sexualized D. ryukyuensis and sequenced and analyzed 8,988 expressed sequence tags (ESTs). The ESTs were analyzed and grouped into 3,077 non-redundant sequences, leaving 1,929 singletons that formed the basis of unigene sets. Fifty-six percent of the cDNAs analyzed shared similarity (E-value<1E -20) with sequences deposited in NCBI. Highly redundant sequences encoded granulin and actin, which are expressed in the whole body, and other redundant sequences encoded a Vasa-like protein, which is known to be a component of germ-line cells and is expressed in the ovary, and Y-protein, which is expressed in the testis. The sexualized planarian expressed sequence tag database (http://planaria.bio.keio.ac.jp/planaria/) is an open-access, online resource providing access to sequence, classification, clustering, and annotation data. This database should constitute a powerful tool for analyzing sexualization in planarians

    Directed acyclic graph kernels for structural RNA analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity.</p> <p>Results</p> <p>We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering.</p> <p>Conclusion</p> <p>Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.</p
    corecore