8 research outputs found

    Context-driven discovery of gene cassettes in mobile integrons using a computational grammar

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene discovery algorithms typically examine sequence data for low level patterns. A novel method to computationally discover higher order DNA structures is presented, using a context sensitive grammar. The algorithm was applied to the discovery of gene cassettes associated with integrons. The discovery and annotation of antibiotic resistance genes in such cassettes is essential for effective monitoring of antibiotic resistance patterns and formulation of public health antibiotic prescription policies.</p> <p>Results</p> <p>We discovered two new putative gene cassettes using the method, from 276 integron features and 978 GenBank sequences. The system achieved <it>κ </it>= 0.972 annotation agreement with an expert gold standard of 300 sequences. In rediscovery experiments, we deleted 789,196 cassette instances over 2030 experiments and correctly relabelled 85.6% (<it>α </it>≥ 95%, <it>E </it>≤ 1%, mean sensitivity = 0.86, specificity = 1, F-score = 0.93), with no false positives.</p> <p>Error analysis demonstrated that for 72,338 missed deletions, two adjacent deleted cassettes were labeled as a single cassette, increasing performance to 94.8% (mean sensitivity = 0.92, specificity = 1, F-score = 0.96).</p> <p>Conclusion</p> <p>Using grammars we were able to represent heuristic background knowledge about large and complex structures in DNA. Importantly, we were also able to use the context embedded in the model to discover new putative antibiotic resistance gene cassettes. The method is complementary to existing automatic annotation systems which operate at the sequence level.</p

    Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes

    Get PDF
    Abstract Background Bacterial promoters, which increase the efficiency of gene expression, differ from other promoters by several characteristics. This difference, not yet widely exploited in bioinformatics, looks promising for the development of relevant computational tools to search for strong promoters in bacterial genomes. Results We describe a new triad pattern algorithm that predicts strong promoter candidates in annotated bacterial genomes by matching specific patterns for the group I σ70 factors of Escherichia coli RNA polymerase. It detects promoter-specific motifs by consecutively matching three patterns, consisting of an UP-element, required for interaction with the α subunit, and then optimally-separated patterns of -35 and -10 boxes, required for interaction with the σ70 subunit of RNA polymerase. Analysis of 43 bacterial genomes revealed that the frequency of candidate sequences depends on the A+T content of the DNA under examination. The accuracy of in silico prediction was experimentally validated for the genome of a hyperthermophilic bacterium, Thermotoga maritima, by applying a cell-free expression assay using the predicted strong promoters. In this organism, the strong promoters govern genes for translation, energy metabolism, transport, cell movement, and other as-yet unidentified functions. Conclusion The triad pattern algorithm developed for predicting strong bacterial promoters is well suited for analyzing bacterial genomes with an A+T content of less than 62%. This computational tool opens new prospects for investigating global gene expression, and individual strong promoters in bacteria of medical and/or economic significance.</p

    A sequence-length sensitive approach to learning biological grammars using inductive logic programming.

    Get PDF
    This thesis aims to investigate if the ideas behind compression principles, such as the Minimum Description Length, can help us to improve the process of learning biological grammars from protein sequences using Inductive Logic Programming (ILP). Contrary to most traditional ILP learning problems, biological sequences often have a high variation in their length. This variation in length is an important feature of biological sequences which should not be ignored by ILP systems. However we have identified that some ILP systems do not take into account the length of examples when evaluating their proposed hypotheses. During the learning process, many ILP systems use clause evaluation functions to assign a score to induced hypotheses, estimating their quality and effectively influencing the search. Traditionally, clause evaluation functions do not take into account the length of the examples which are covered by the clause. We propose L-modification, a way of modifying existing clause evaluation functions so that they take into account the length of the examples which they learn from. An empirical study was undertaken to investigate if significant improvements can be achieved by applying L-modification to a standard clause evaluation function. Furthermore, we generally investigated how ILP systems cope with the length of examples in training data. We show that our L-modified clause evaluation function outperforms our benchmark function in every experiment we conducted and thus we prove that L-modification is a useful concept. We also show that the length of the examples in the training data used by ILP systems does have an undeniable impact on the results

    Computational synthesis for scientific experimentation

    Get PDF

    Biochemistry students' difficulties with the symbolic and visual language used in molecular biology.

    Get PDF
    Thesis (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2007.This study reports on recurring difficulties experienced by undergraduate students with respect to understanding and interpretation of certain symbolism, nomenclature, terminology, shorthand notation, models and other visual representations employed in the field of Molecular Biology to communicate information. Based on teaching experience and guidelines set out by a four-level methodological framework, data on various topic-related difficulties was obtained by inductive analyses of students’ written responses to specifically designed, free-response and focused probes. In addition, interviews, think-aloud exercises and student-generated diagrams were also used to collect information. Both unanticipated and recurring difficulties were compared with scientifically correct propositional knowledge, categorized and subsequently classified. Students were adept at providing the meaning of the symbol “Δ” in various scientific contexts; however, some failed to recognize its use to depict the deletion of a leucine biosynthesis gene in the form, Δ leu. “Hazard to leucine”, “change to leucine” and “abbreviation for isoleucine” were some of the erroneous interpretations of this polysemic symbol. Investigations on these definitions suggest a constructivist approach to knowledge construction and the inappropriate transfer of knowledge from prior mental schemata. The symbol, “::”, was poorly differentiated by students in its use to indicate gene integration or transposition and in tandem gene fusion. Idiosyncratic perceptions emerged suggesting that it is, for example, a proteinaceous component linking genes in a chromosome or the centromere itself associated with the mitotic spindle or “electrons” between genes in the same way that it is symbolically shown in Lewis dot diagrams which illustrate covalent bonding between atoms. In an oligonucleotide shorthand notation, some students used valency to differentiate the phosphite trivalent form of the phosphorus atom from the pentavalent phosphodiester group, yet the concept of valency was poorly understood. By virtue of the visual form of a shorthand notation of the 3,5 phosphodiester link in DNA, the valency was incorrectly read. VSEPR theory and the Octet Rule were misunderstood or forgotten when trying to explain the valency of the phosphorus atom in synthetic oligonucleotide intermediates. Plasmid functional domains were generally well-understood although restriction mapping appeared to be a cognitively demanding task. Rote learning and substitution of definitions were evident in the explanation of promoter and operator functions. The concept of gene expression posed difficulties to many students who believed that genes contain the entity they encode. Transcription and translation of in tandem gene fusions were poorly explained by some students as was the effect of plasmid conformation on transformation and gene expression. With regard to the selection of transformants or the hybridoma, some students could not engage in reasoning or lateral thinking as protoconcepts and domain-specific information were poorly understood. A failure to integrate and reason with factual information on phenotypic traits, media components and biochemical pathways were evident in written and oral presentations. DNA-strand nomenclature and associated function were problematic to some students as they failed to differentiate coding strand from template strand and were prone to interchange the labelling of these. A substitution of labels with those characterizing DNA replication intermediates demonstrated erroneous information transfer. DNA replication models posed difficulties integrating molecular mechanisms and detail with line drawings, coupled with inaccurate illustrations of sequential replication features. Finally, a remediation model is presented, demonstrating a shift in assessment score dispersion from a range of 0 - 4.5 to 4 - 9 when learners are guided metacognitively to work with domain-specific or critical knowledge from an information bank. The present work shows that varied forms of symbolism can present students with complex learning difficulties as the underlying information depicted by these is understood in a superficial way. It is imperative that future studies be focused on the standardization of symbol use, perhaps governed by convention that determines the manner in which threshold information is disseminated on symbol use, coupled by innovative teaching strategies which facilitate an improved understanding of the use of symbolic representations in Molecular Biology. As Molecular Biology advances, it is likely that experts will continue to use new and diverse forms of symbolic representations to explain their findings. The explanation of futuristic Science is likely to develop a symbolic language that will impose great teaching challenges and unimaginable learning difficulties to new generation teachers and learners, respectively
    corecore