29,043 research outputs found

    Towards a Taxonomically Intelligent Phylogenetic Database

    Get PDF
    This note outlines some of the key intellectual obstacles that stand in the way of creating a usable phylogenetic database. These challenges include the need to accommodate multiple taxonomic names and classifications, and the need for tools to query trees in biologically meaningful ways. Until these problems are addressed, and a taxonomically intelligent phylogenetic database created, much of our phylogenetic knowledge will languish in the pages of journals

    The Dawn of Open Access to Phylogenetic Data

    Get PDF
    The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computational investment. Moreover, the resulting phylogenetic data provide an explicit historical perspective that critically informs research in a vast and growing number of scientific disciplines. One such use is the study of changes in rates of lineage diversification (speciation - extinction) through time. As part of a meta-analysis in this area, we sought to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 studies published in 46 journals over a 13-year period. We document our attempts to procure those data (from online archives and by direct request to corresponding authors), and report results of analyses (using Bayesian logistic regression) to assess the impact of various factors on the success of our efforts. Overall, complete phylogenetic data for ~60% of these studies are effectively lost to science. Our study indicates that phylogenetic data are more likely to be deposited in online archives and/or shared upon request when: (1) the publishing journal has a strong data-sharing policy; (2) the publishing journal has a higher impact factor, and; (3) the data are requested from faculty rather than students. Although the situation appears dire, our analyses suggest that it is far from hopeless: recent initiatives by the scientific community -- including policy changes by journals and funding agencies -- are improving the state of affairs

    Exploring deep phylogenies using protein structure : a dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Biochemistry, Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand

    Get PDF
    Recent times have seen an exponential growth in protein sequence and structure data. The most popular way of characterising newly determined protein sequences is to compare them to well characterised sequences and predict the function of novel sequences based on homology. This practice has been highly successful for a majority of proteins. However, these sequence based methods struggle with certain deeply diverging proteins and hence cannot always recover evolutionary histories. Another feature of proteins, namely their structures, has been shown to retain evolutionary signals over longer time scales compared to the respective sequences that encode them. The structure therefore presents an opportunity to uncover the evolutionary signal that otherwise escapes conventional sequence-based methods. Structural phylogenetics refers to the comparison of protein structures to extract evolutionary relationships. The area of structural phylogenetics has been around for a number of years and multiple approaches exist to delineate evolutionary relationships from protein structures. However, once the relationships have been recovered from protein structural data, no methods exist, at present, to verify the robustness of these relationships. Because of the nature of the structural data, conventional sequence-based methods, e.g. bootstrapping, cannot be applied. This work introduces the first ever use of a molecular dynamics (MD)-based bootstrap method, which can add a measure of significance to the relationships inferred from the structure-based analysis. This work begins in Chapter 2 by thoroughly investigating the use of a protein structural comparison metric Qscore, which has previously been used to generate structural phylogenies, and highlights its strengths and weaknesses. The mechanistic exploration of the structural comparison metric reveals a size difference limit of no more than 5-10% in the sizes of protein structures being compared for accurate phylogenetic inference to be made. Chapter 2 also explores the MD-based bootstrap method to offer an interpretation of the significance values recovered. Two protein structural datasets, one relatively more conserved at the sequence level than the other and with different levels of structural conservation are used as controls to simplify the interpretation of the statistics recovered from the MD-based bootstrap method. Chapter 3 then sees the application of the Qscore metric to the aminoacyl-tRNA synthetases. The aminoacyl-tRNA synthetases are believed to have been present at the dawn of life, making them one of the most ancient protein families. Due to the important functional role they play, these proteins are conserved at both sequence and structural levels and well-characterised using both sequence and structure-based comparative methods. This family therefore offered inferences which could be informed with structural analysis using an automated method. Successful recovery of known relationships raised confidence in the ability of structural phylogenetic analysis based on Qscore to detect evolutionary signals. In Chapter 4, a structural phylogeny was created for a protein structural dataset presenting either the histone fold or its ancestral precursor. This structural dataset comprised of proteins that were significantly diverged at a sequence level, however shared a common structural motif. The structural phylogeny recovered the split between bacterial and non-bacterial proteins. Furthermore, TATA protein associated factors were found to have multiple points of origin. Moreover, some mismatch was found between the classifications of these proteins between SCOP and PFam, which also did not agree with the results from this work. Using the structural phylogeny a model outlining the evolution of these proteins was proposed. The structural phylogeny of the Ferritin-like superfamily has previously been generated using the Qscore metric and supported qualitatively. Chapter 5 recovers the structural phylogeny of the Ferritin-like superfamily and finds quantitative support for the inferred relationships from the first ever implementation of the MD-based bootstrap method. The use of the MD-based bootstrap method simultaneously allows for the resolution of polytomies in structural databases. Some limitations of the MD-based bootstrap method, highlighted in Chapter 2, are revisited in Chapter 5. This work indicates that evolutionary signals can be successfully extracted from protein structures for deeply diverging proteins and that the MD-based bootstrap method can be used to gauge the robustness of relationships inferred

    TRAPID : an efficient online tool for the functional and comparative analysis of de novo RNA-Seq transcriptomes

    Get PDF
    Transcriptome analysis through next-generation sequencing technologies allows the generation of detailed gene catalogs for non-model species, at the cost of new challenges with regards to computational requirements and bioinformatics expertise. Here, we present TRAPID, an online tool for the fast and efficient processing of assembled RNA-Seq transcriptome data, developed to mitigate these challenges. TRAPID offers high-throughput open reading frame detection, frameshift correction and includes a functional, comparative and phylogenetic toolbox, making use of 175 reference proteomes. Benchmarking and comparison against state-of-the-art transcript analysis tools reveals the efficiency and unique features of the TRAPID system

    Comparative genomic analysis of Acinetobacter spp. plasmids originating from clinical settings and environmental habitats

    Get PDF
    Bacteria belonging to the genus Acinetobacter have become of clinical importance over the last decade due to the development of a multi-resistant phenotype and their ability to survive under multiple environmental conditions. The development of these traits among Acinetobacter strains occurs frequently as a result of plasmid-mediated horizontal gene transfer. In this work, plasmids from nosocomial and environmental Acinetobacter spp. collections were separately sequenced and characterized. Assembly of the sequenced data resulted in 19 complete replicons in the nosocomial collection and 77 plasmid contigs in the environmental collection. Comparative genomic analysis showed that many of them had conserved backbones. Plasmid coding sequences corresponding to plasmid specific functions were bioinformatically and functionally analyzed. Replication initiation protein analysis revealed the predominance of the Rep_3 superfamily. The phylogenetic tree constructed from all Acinetobacter Rep_3 superfamily plasmids showed 16 intermingled clades originating from nosocomial and environmental habitats. Phylogenetic analysis of relaxase proteins revealed the presence of a new sub-clade named MOBQAci, composed exclusively of Acinetobacter relaxases. Functional analysis of proteins belonging to this group showed that they behaved differently when mobilized using helper plasmids belonging to different incompatibility groups.Fil: Salto, Ileana Paula. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - La Plata. Instituto de BiotecnologĂ­a y BiologĂ­a Molecular. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de BiotecnologĂ­a y BiologĂ­a Molecular; ArgentinaFil: Torres Tejerizo, Gonzalo Arturo. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - La Plata. Instituto de BiotecnologĂ­a y BiologĂ­a Molecular. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de BiotecnologĂ­a y BiologĂ­a Molecular; Argentina. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: Wibberg, Daniel. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: PĂŒhler, Alfred. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: SchlĂŒter, Andreas. Universitat Bielefeld. Center For Biotechnology; AlemaniaFil: Pistorio, Mariano. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - La Plata. Instituto de BiotecnologĂ­a y BiologĂ­a Molecular. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de BiotecnologĂ­a y BiologĂ­a Molecular; Argentin

    Innovative in silico approaches to address avian flu using grid technology

    Get PDF
    The recent years have seen the emergence of diseases which have spread very quickly all around the world either through human travels like SARS or animal migration like avian flu. Among the biggest challenges raised by infectious emerging diseases, one is related to the constant mutation of the viruses which turns them into continuously moving targets for drug and vaccine discovery. Another challenge is related to the early detection and surveillance of the diseases as new cases can appear just anywhere due to the globalization of exchanges and the circulation of people and animals around the earth, as recently demonstrated by the avian flu epidemics. For 3 years now, a collaboration of teams in Europe and Asia has been exploring some innovative in silico approaches to better tackle avian flu taking advantage of the very large computing resources available on international grid infrastructures. Grids were used to study the impact of mutations on the effectiveness of existing drugs against H5N1 and to find potentially new leads active on mutated strains. Grids allow also the integration of distributed data in a completely secured way. The paper presents how we are currently exploring how to integrate the existing data sources towards a global surveillance network for molecular epidemiology.Comment: 7 pages, submitted to Infectious Disorders - Drug Target

    TreeRipper: towards a fully automated optical tree recognition software

    Get PDF
    Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21st century. 
TreeRipper is a command line c++ program for the fully-automated recognition of multifurcating phylogenetic trees. The program accepts a range of input image formats (PNG, JPG/JPEG, GIF, TIFF or PDF ). Then follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves. 
Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software.
&#xa
    • 

    corecore