12 research outputs found

    Needed for completion of the human genome: hypothesis driven experiments and biologically realistic mathematical models

    Get PDF
    With the sponsorship of ``Fundacio La Caixa'' we met in Barcelona, November 21st and 22nd, to analyze the reasons why, after the completion of the human genome sequence, the identification all protein coding genes and their variants remains a distant goal. Here we report on our discussions and summarize some of the major challenges that need to be overcome in order to complete the human gene catalog.Comment: Report and discussion resulting from the `Fundacio La Caixa' gene finding meeting held November 21 and 22 2003 in Barcelon

    Computational approaches to gene finding

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.Includes bibliographical references (leaf 37).by Eric Banks.M.Eng

    The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High quality annotation of the genes and transposable elements in complex genomes requires a human-curated integration of multiple sources of computational evidence. These evidences include results from a diversity of <it>ab initio </it>prediction programs as well as homology-based searches. Most of these programs operate on a single contiguous sequence at a time, and the results are generated in a diverse array of readable formats that must be translated to a standardized file format. These translated results must then be concatenated into a single source, and then presented in an integrated form for human curation.</p> <p>Results</p> <p>We have designed, implemented, and assessed a Perl-based workflow named DAWGPAWS for the generation of computational results for human curation of the genes and transposable elements in plant genomes. The use of DAWGPAWS was found to accelerate annotation of 80–200 kb wheat DNA inserts in bacterial artificial chromosome (BAC) vectors by approximately twenty-fold and to also significantly improve the quality of the annotation in terms of completeness and accuracy.</p> <p>Conclusion</p> <p>The DAWGPAWS genome annotation pipeline fills an important need in the annotation of plant genomes by generating computational evidences in a high throughput manner, translating these results to a common file format, and facilitating the human curation of these computational results. We have verified the value of DAWGPAWS by using this pipeline to annotate the genes and transposable elements in 220 BAC insertions from the hexaploid wheat genome (<it>Triticum aestivum </it>L.). DAWGPAWS can be applied to annotation efforts in other plant genomes with minor modifications of program-specific configuration files, and the modular design of the workflow facilitates integration into existing pipelines.</p

    Genome size evolution in pufferfish: an insight from BAC clone-based Diodon holocanthus genome sequencing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Variations in genome size within and between species have been observed since the 1950 s in diverse taxonomic groups. Serving as model organisms, smooth pufferfish possess the smallest vertebrate genomes. Interestingly, spiny pufferfish from its sister family have genome twice as large as smooth pufferfish. Therefore, comparative genomic analysis between smooth pufferfish and spiny pufferfish is useful for our understanding of genome size evolution in pufferfish.</p> <p>Results</p> <p>Ten BAC clones of a spiny pufferfish <it>Diodon holocanthus </it>were randomly selected and shotgun sequenced. In total, 776 kb of non-redundant sequences without gap representing 0.1% of the <it>D. holocanthus </it>genome were identified, and 77 distinct genes were predicted. In the sequenced <it>D. holocanthus </it>genome, 364 kb is homologous with 265 kb of the <it>Takifugu rubripes </it>genome, and 223 kb is homologous with 148 kb of the <it>Tetraodon nigroviridis </it>genome. The repetitive DNA accounts for 8% of the sequenced <it>D. holocanthus </it>genome, which is higher than that in the <it>T. rubripes </it>genome (6.89%) and that in the <it>Te. nigroviridis </it>genome (4.66%). In the repetitive DNA, 76% is retroelements which account for 6% of the sequenced <it>D. holocanthus </it>genome and belong to known families of transposable elements. More than half of retroelements were distributed within genes. In the non-homologous regions, repeat element proportion in <it>D. holocanthus </it>genome increased to 10.6% compared with <it>T. rubripes </it>and increased to 9.19% compared with <it>Te. nigroviridis</it>. A comparison of 10 well-defined orthologous genes showed that the average intron size (566 bp) in <it>D. holocanthus </it>genome is significantly longer than that in the smooth pufferfish genome (435 bp).</p> <p>Conclusion</p> <p>Compared with the smooth pufferfish, <it>D. holocanthus </it>has a low gene density and repeat elements rich genome. Genome size variation between <it>D. holocanthus </it>and the smooth pufferfish exhibits as length variation between homologous region and different accumulation of non-homologous sequences. The length difference of intron is consistent with the genome size variation between <it>D. holocanthus </it>and the smooth pufferfish. Different transposable element accumulation is responsible for genome size variation between <it>D. holocanthus </it>and the smooth pufferfish.</p

    Method of predicting Splice Sites based on signal interactions

    Get PDF
    BACKGROUND: Predicting and proper ranking of canonical splice sites (SSs) is a challenging problem in bioinformatics and machine learning communities. Any progress in SSs recognition will lead to better understanding of splicing mechanism. We introduce several new approaches of combining a priori knowledge for improved SS detection. First, we design our new Bayesian SS sensor based on oligonucleotide counting. To further enhance prediction quality, we applied our new de novo motif detection tool MHMMotif to intronic ends and exons. We combine elements found with sensor information using Naive Bayesian Network, as implemented in our new tool SpliceScan. RESULTS: According to our tests, the Bayesian sensor outperforms the contemporary Maximum Entropy sensor for 5' SS detection. We report a number of putative Exonic (ESE) and Intronic (ISE) Splicing Enhancers found by MHMMotif tool. T-test statistics on mouse/rat intronic alignments indicates, that detected elements are on average more conserved as compared to other oligos, which supports our assumption of their functional importance. The tool has been shown to outperform the SpliceView, GeneSplicer, NNSplice, Genio and NetUTR tools for the test set of human genes. SpliceScan outperforms all contemporary ab initio gene structural prediction tools on the set of 5' UTR gene fragments. CONCLUSION: Designed methods have many attractive properties, compared to existing approaches. Bayesian sensor, MHMMotif program and SpliceScan tools are freely available on our web site. REVIEWERS: This article was reviewed by Manyuan Long, Arcady Mushegian and Mikhail Gelfand

    Computational methods in Bioinformatics: Introduction, Review, and Challenges

    Get PDF
    Biotechnology is emerging as a new driving force for the global economy in the 21st century. An important engine for biotechnology is Bioinformatics. Bioinformatics has revolutionized biology research and drug discovery. Bioinformatics is an amalgamation of biological sciences, computer science, applied math, and systems science. The report provides a brief introduction to molecular biology for non-biologists, with focus on understanding the basic biological problems which triggered the exponentially growing research activities in the bioinformatics fields. The report provides as well a comprehensive literature review of the main challenging problems, and the current tools and algorithms. In particular, the problems of gene modeling, and gene prediction, similarity search, multiple alignments of proteins, and the protein folding problems are highlighted. The report discusses as well how such tools as dynamic programming, hidden Markov models, statistical analysis, clustering, decision trees, fuzzy theory, and neural networks have been applied in solving these problems

    Computational methods in Bioinformatics: Introduction, Review, and Challenges

    Get PDF
    Biotechnology is emerging as a new driving force for the global economy in the 21st century. An important engine for biotechnology is Bioinformatics. Bioinformatics has revolutionized biology research and drug discovery. Bioinformatics is an amalgamation of biological sciences, computer science, applied math, and systems science. The report provides a brief introduction to molecular biology for non-biologists, with focus on understanding the basic biological problems which triggered the exponentially growing research activities in the bioinformatics fields. The report provides as well a comprehensive literature review of the main challenging problems, and the current tools and algorithms. In particular, the problems of gene modeling, and gene prediction, similarity search, multiple alignments of proteins, and the protein folding problems are highlighted. The report discusses as well how such tools as dynamic programming, hidden Markov models, statistical analysis, clustering, decision trees, fuzzy theory, and neural networks have been applied in solving these problems

    Development of gene-finding algorithms for fungal genomes : dealing with small datasets and leveraging comparative genomics

    Get PDF
    Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (leaves 60-62).A computer program called FUNSCAN was developed which identifies protein coding regions in fungal genomes. Gene structural and compositional properties are modeled using a Hidden Markov Model. Separate training and testing sets for FUNSCAN were obtained by aligning cDNAs from an organism to their genomic loci, generating a 'gold standard' set of annotated genes. The performance of FUNSCAN is competitive with other computer programs design to identify protein coding regions in fungal genomes. A technique called 'Training Set Augmentation' is described which can be used to train FUNSCAN when only a small training set of genes is available. Techniques that combine alignment algorithms with FUNSCAN to identify novel genes are also discussed and explored.by Allan Lazarovici.M.Eng.and S.B

    Kompression von DNA Sequenzen

    Get PDF
    Standardkompressionsverfahren erreichen bei Texten, die Programmkode oder menschliche Sprache enthalten, sehr hohe Kompressionsraten. DNA Sequenzen weisen eine lineare Basenabfolge auf und können daher auch als Texte betrachtet werden. Allerdings gelingt es den Standardkompressoren in den meisten Fällen nicht, DNA Sequenzen zu komprimieren. Selbst wenn sich eine DNA Sequenz komprimieren läßt, ist die Kompressionsrate äußerst gering. Es gibt zwei mögliche Erklärungen für diese Beobachtung. Entweder sind DNA Sequenzen generell nicht komprimierbar oder die bisherigen Kompressoren sind dazu nicht in der Lage. In dieser Diplomarbeit soll diese Frage untersucht werden. Dabei wird sich herausstellen, daß sich auch DNA Sequenzen komprimieren lassen, wenn ein Kompressor charakteristische Eigenschaften ausnutzt. Es werden wesentliche Unterschiede von menschlicher Sprache zu DNA Sequenzen erläutert und DNA spezifische Kompressionsverfahren vorgestellt. Kompression läßt sich nicht nur zur Reduktion des benötigten Speichers verwenden, sondern es gibt eine Vielzahl an Anwendungsgebieten, wo Kompressionsverfahren in der Biologie eingesetzt werden können. So lassen sich phylogenetische Bäume mit Hilfe von Kompressionsverfahren aus genetischen Daten rekonstruieren. Auch zur Charakterisierung von unterschiedlichen Regionen im Genom kann ein Kompressor verwendet werden. Im zweiten Teil der Arbeit wird auf diese Gebiete näher eingegangen. Dabei werden praktische Versuche durchgeführt, um Nutzen und Anwendbarkeit genauer zu untersuchen, aber auch um Grenzen der Methoden aufzuzeigen

    The human solute carrier 26 family of anion exchangers

    Get PDF
    corecore