1,629 research outputs found

    An approximate search engine for structure

    Get PDF
    As the size of structural databases grows, the need for efficiently searching these databases arises. Thanks to previous and ongoing research, searching by attribute-value and by text has become commonplace in these databases. However, searching by topological or physical structure, especially for large databases and especially for approximate matches, is still an art. In this dissertation, efficient search techniques are presented for retrieving trees from a database that are similar to a given query tree. Rooted ordered labeled trees, rooted unordered labeled trees and free trees are considered. Ordered labeled trees are trees in which each node has a label and the left-to-right order among siblings matters. Unordered labeled trees are trees in which the parent-child relationship is significant, but the order among siblings is unimportant. Free trees (unrooted unordered trees) are acyclic graphs. These trees find many applications in bioinformatics, Web log analysis, phyloinformatics, XML processing, etc. Two types of similarity measures are investigated: (i) counting the mismatching paths in the query tree and a data tree, and (ii) measuring the topological relationship between the trees. The proposed approaches include storing the paths of trees in a suffix array, employing hashing techniques to speed up retrieval, and counting the number of up-down operations to move a token from one node to another node in a tree. Various filters for accelerating a search, different strategies for parallelizing these search algorithms and applications of these algorithms to XML and phylogenetic data management are discussed. The proposed techniques have been implemented into a phylogenetic search engine which is fully operational and is available on the World Wide Web. Experimental results on comparing the similarity measures with existing tree metrics and on evaluating the efficiency of the search techniques demonstrate the effectiveness of the search engine. Future work includes extending the techniques to other structural data, as well as developing new filters and algorithms for speeding up searching and mining in complex structures

    Creation, evaluation, and use of PSI, a program for identifying protein-phenotype relationships and comparing protein content in groups of organisms

    Get PDF
    Recent advances in DNA sequencing technology have enabled entire genomes to be sequenced quickly and accurately, resulting in an exponential increase in the number of organisms whose genome sequences have been elucidated. While the genome sequence of a given organism represents an important starting point in understanding its physiology, the functions of the protein products of many genes are still unknown; as such, computational methods for studying protein function are becoming increasingly important. In addition, this wealth of genomic information has created an unprecedented opportunity to compare the protein content of different organisms; among other applications, this can enable us to improve taxonomic classifications, to develop more accurate diagnostic tests for identifying particular bacteria, and to better understand protein content relationships in both closely-related and distantly-related organisms. This thesis describes the design, evaluation, and use of a program called Proteome Subtraction and Intersection (PSI) that uses an idea called genome subtraction for discovering protein-phenotype relationships and for characterizing differences in protein content in groups of organisms. PSI takes as input a set of proteomes, as well as a partitioning of that set into a subset of "included" proteomes and a subset of "excluded" proteomes. Using reciprocal BLAST hits, PSI finds orthologous relationships among all the proteins in the proteomes from the original set, and then finds groups of orthologous proteins containing at least one orthologue from each of the proteomes in the "included" subset, and none from any of the proteomes in the "excluded" subset. PSI is first applied to finding protein-phenotype relationships. By identifying proteins that are present in all sequenced isolates of the genus Lactobacillus, but not in the related bacterium Pediococcus pentosaceus, proteins are discovered that are likely to be responsible for the difference in cell shape between the lactobacilli and P. pentosaceus. In addition, proteins are identified that may be responsible for resistance to the antibiotic gatifloxacin in some lactic acid bacteria. This thesis also explores the use of PSI for comparing protein content in groups of organisms. Based on the idea of genome subtraction, a novel metric is proposed for comparing the difference in protein content between two organisms. This metric is then used to create a phylogenetic tree for a large set of bacteria, which to the author's knowledge represents the largest phylogenetic tree created to date using protein content. In addition, PSI is used to find the proteomic cohesiveness of isolates of several bacterial species in order to support or refute their current taxonomic classifications. Overall, PSI is a versatile tool with many interesting applications, and should become more and more valuable as additional genomic information becomes available

    NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata

    Get PDF
    In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input–output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML

    NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata

    Get PDF
    In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input–output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.R.A.V. received support from the CIPRES project (NSF #EF-03314953 to W.P.M.), the FP7 Marie Curie Programme (Call FP7-PEOPLE-IEF-2008—Proposal No. 237046) and, for the NeXML implementation in TreeBASE, the pPOD project (NSF IIS 0629846); P.E.M. and J.S. received support from CIPRES (NSF #EF-0331495, #EF-0715370); M.T.H. was supported by NSF (DEB-ATOL-0732920); X.X. received support from NSERC (Canada) Discovery and RTI grants; W.P.M. received support from an NSERC (Canada) Discovery grant; J.C. received support from a Google Summer of Code 2007 grant; A.P. received support from a Google Summer of Code 2010 grant

    Computational Methods for Comparative Non-coding RNA Analysis: from Secondary Structures to Tertiary Structures

    Get PDF
    Unlike message RNAs (mRNAs) whose information is encoded in the primary sequences, the cellular roles of non-coding RNAs (ncRNAs) originate from the structures. Therefore studying the structural conservation in ncRNAs is important to yield an in-depth understanding of their functionalities. In the past years, many computational methods have been proposed to analyze the common structural patterns in ncRNAs using comparative methods. However, the RNA structural comparison is not a trivial task, and the existing approaches still have numerous issues in efficiency and accuracy. In this dissertation, we will introduce a suite of novel computational tools that extend the classic models for ncRNA secondary and tertiary structure comparisons. For RNA secondary structure analysis, we first developed a computational tool, named PhyloRNAalifold, to integrate the phylogenetic information into the consensus structural folding. The underlying idea of this algorithm is that the importance of a co-varying mutation should be determined by its position on the phylogenetic tree. By assigning high scores to the critical covariances, the prediction of RNA secondary structure can be more accurate. Besides structure prediction, we also developed a computational tool, named ProbeAlign, to improve the efficiency of genome-wide ncRNA screening by using high-throughput RNA structural probing data. It treats the chemical reactivities embedded in the probing information as pairing attributes of the searching targets. This approach can avoid the time-consuming base pair matching in the secondary structure alignment. The application of ProbeAlign to the FragSeq datasets shows its capability of genome-wide ncRNAs analysis. For RNA tertiary structure analysis, we first developed a computational tool, named STAR3D, to find the global conservation in RNA 3D structures. STAR3D aims at finding the consensus of stacks by using 2D topology and 3D geometry together. Then, the loop regions can be ordered and aligned according to their relative positions in the consensus. This stack-guided alignment method adopts the divide-and-conquer strategy into RNA 3D structural alignment, which has improved its efficiency dramatically. Furthermore, we also have clustered all loop regions in non-redundant RNA 3D structures to de novo detect plausible RNA structural motifs. The computational pipeline, named RNAMSC, was extended to handle large-scale PDB datasets, and solid downstream analysis was performed to ensure the clustering results are valid and easily to be applied to further research. The final results contain many interesting variations of known motifs, such as GNAA tetraloop, kink-turn, sarcin-ricin and t-loops. We also discovered novel functional motifs that conserved in a wide range of ncRNAs, including ribosomal RNA, sgRNA, SRP RNA, GlmS riboswitch and twister ribozyme

    A genome phylogeny for mitochondria among alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes

    Get PDF
    Analyses of 55 individual and 31 concatenated protein data sets encoded in Reclinomonas americana and Marchantia polymorpha mitochondrial genomes revealed that current methods for constructing phylogenetic trees are insufficiently sensitive (or artifact-insensitive) to ascertain the sister of mitochondria among the current sample of eight alpha-proteobacterial genomes using mitochondrially-encoded proteins. However, Rhodospirillum rubrum came as close to mitochondria as any alpha-proteobacterium investigated. This prompted a search for methods to directly compare eukaryotic genomes to their prokaryotic counterparts to investigate the origin of the mitochondrion and its host from the standpoint of nuclear genes. We examined pairwise amino acid sequence identity in comparisons of 6,214 nuclear protein-coding genes from Saccharomyces cerevisiae to 177,117 proteins encoded in sequenced genomes from 45 eubacteria and 15 archaebacteria. The results reveal that approximately 75% of yeast genes having homologues among the present prokaryotic sample share greater amino acid sequence identity to eubacterial than to archaebacterial homologues. At high stringency comparisons, only the eubacterial component of the yeast genome is detectable. Our findings indicate that at the levels of overall amino acid sequence identity and gene content, yeast shares a sister-group relationship with eubacteria, not with archaebacteria, in contrast to the current phylogenetic paradigm based on ribosomal RNA. Among eubacteria and archaebacteria, proteobacterial and methanogen genomes, respectively, shared more similarity with the yeast genome than other prokaryotic genomes surveyed
    corecore