2,969 research outputs found

    Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

    Get PDF
    Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/

    Unfolding RNA 3D structures for secondary structure prediction benchmarking

    Full text link
    Les acides ribonuclĂ©iques (ARN) forment des structures tri-dimensionnelles complexes stabilisĂ©es par la formation de la structure secondaire (2D), elle-mĂȘme formĂ©e de paires de bases. Plusieurs mĂ©thodes computationnelles ont Ă©tĂ© crĂ©Ă©es dans les derniĂšres annĂ©es afin de prĂ©dire la structure 2D d’ARNs, en partant de la sĂ©quence. Afin de simplifier le calcul, ces mĂ©thodes appliquent gĂ©nĂ©ralement des restrictions sur le type de paire de bases et la topologie des structures 2D prĂ©dites. Ces restrictions font en sorte qu’il est parfois difficile de savoir Ă  quel point la totalitĂ© des paires de bases peut ĂȘtre reprĂ©sentĂ©e par ces structures 2D restreintes. MC-Unfold fut crĂ©Ă© afin de trouver les structures 2D restreintes qui pourraient ĂȘtre associĂ©es Ă  une structure secondaire complĂšte, en fonction des restrictions communĂ©ment utilisĂ©es par les mĂ©thodes de prĂ©diction de structure secondaire. Un ensemble de 321 monomĂšres d’ARN totalisant plus de 4223 structures fut assemblĂ© afin d’évaluer les mĂ©thodes de prĂ©diction de structure 2D. La majoritĂ© de ces structures ont Ă©tĂ© dĂ©terminĂ©es par rĂ©sonance magnĂ©tique nuclĂ©aire et crystallographie aux rayons X. Ces structures ont Ă©tĂ© dĂ©pliĂ©s par MC-Unfold et les structures rĂ©sultantes ont Ă©tĂ© comparĂ©es Ă  celles prĂ©dites par les mĂ©thodes de prĂ©diction. La performance de MC-Unfold sur un ensemble de structures expĂ©rimentales est encourageante. En moins de 5 minutes, 96% des 227 structures ont Ă©tĂ© complĂštement dĂ©pliĂ©es, le reste des structures Ă©tant trop complexes pour ĂȘtre dĂ©pliĂ© rapidement. Pour ce qui est des mĂ©thodes de prĂ©diction de structure 2D, les rĂ©sultats indiquent qu’elles sont capable de prĂ©dire avec un certain succĂšs les structures expĂ©rimentales, particuliĂšrement les petites molĂ©cules. Toutefois, si on considĂšre les structures larges ou contenant des pseudo-noeuds, les rĂ©sultats sont gĂ©nĂ©ralement dĂ©favorables. Les rĂ©sultats obtenus indiquent que les mĂ©thodes de prĂ©diction de structure 2D devraient ĂȘtre utilisĂ©es avec prudence, particuliĂšrement pour de larges molĂ©cules.Ribonucleic acids (RNA) adopt complex three dimensional structures which are stabilized by the formation of base pairs, also known as the secondary (2D) structure. Predicting where and how many of these interactions occur has been the focus of many computational methods called 2D structure prediction algorithms. These methods disregard some interactions, which makes it difficult to know how well a 2D structure represents an RNA structure, especially when large amounts of base pairs are ignored. MC-Unfold was created to remove interactions violating the assumptions used by prediction methods. This process, named unfolding, extends previous planarization and pseudoknot removal methods. To evaluate how well computational methods can predict experimental structures, a set of 321 RNA monomers corresponding to more than 4223 experimental structures was acquired. These structures were mostly determined using nuclear magnetic resonance and X-ray crystallography. MC-Unfold was used to remove interactions the prediction algorithms were not expected to predict. These structures were then compared with the structured predicted. MC-Unfold performed very well on the test set it was given. In less than five minutes, 96% of the 227 structure could be exhaustively unfolded. The few remaining structures are very large and could not be unfolded in reasonable time. MC-Unfold is therefore a practical alternative to the current methods. As for the evaluation of prediction methods, MC-Unfold demonstrated that the computational methods do find experimental structures, especially for small molecules. However, when considering large or pseudoknotted molecules, the results are not so encouraging. As a consequence, 2D structure prediction methods should be used with caution, especially for large structures

    Exploration of Reaction Pathways and Chemical Transformation Networks

    Full text link
    For the investigation of chemical reaction networks, the identification of all relevant intermediates and elementary reactions is mandatory. Many algorithmic approaches exist that perform explorations efficiently and automatedly. These approaches differ in their application range, the level of completeness of the exploration, as well as the amount of heuristics and human intervention required. Here, we describe and compare the different approaches based on these criteria. Future directions leveraging the strengths of chemical heuristics, human interaction, and physical rigor are discussed.Comment: 48 pages, 4 figure

    A divide-and-conquer approach to analyze underdetermined biochemical models

    Get PDF
    Motivation: To obtain meaningful predictions from dynamic computational models, their uncertain parameter values need to be estimated from experimental data. Due to the usually large number of parameters compared to the available measurement data, these estimation problems are often underdetermined meaning that the solution is a multidimensional space. In this case, the challenge is yet to obtain a sound system understanding despite non-identifiable parameter values, e.g. through identifying those parameters that most sensitively determine the model's behavior. Results: Here, we present the so-called divide-and-conquer approach—a strategy to analyze underdetermined biochemical models. The approach draws on steady state omics measurement data and exploits a decomposition of the global estimation problem into independent subproblems. The solutions to these subproblems are joined to the complete space of global optima, which can be easily analyzed. We derive the conditions at which the decomposition occurs, outline strategies to fulfill these conditions and—using an example model—illustrate how the approach uncovers the most important parameters and suggests targeted experiments without knowing the exact parameter values. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

    Evaluation of protein surface roughness index using its heat denatured aggregates

    Get PDF
    Recent research works on potential of different protein surface describing parameters to predict protein surface properties gained significance for its possible implication in extracting clues on protein's functional site. In this direction, Surface Roughness Index, a surface topological parameter, showed its potential to predict SCOP-family of protein. The present work stands on the foundation of these works where a semi-empirical method for evaluation of Surface Roughness Index directly from its heat denatured protein aggregates (HDPA) was designed and demonstrated successfully. The steps followed consist, the extraction of a feature, Intensity Level Multifractal Dimension (ILMFD) from the microscopic images of HDPA, followed by the mapping of ILMFD into Surface Roughness Index (SRI) through recurrent backpropagation network (RBPN). Finally SRI for a particular protein was predicted by clustering of decisions obtained through feeding of multiple data into RBPN, to obtain general tendency of decision, as well as to discard the noisy dataset. The cluster centre of the largest cluster was found to be the best match for mapping of Surface Roughness Index of each protein in our study. The semi-empirical approach adopted in this paper, shows a way to evaluate protein's surface property without depending on its already evaluated structure

    A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure

    Get PDF
    BACKGROUND: Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N(3)) in memory. This is only practical for small RNAs. RESULTS: I describe a divide and conquer variant of the alignment algorithm that is analogous to memory-efficient Myers/Miller dynamic programming algorithms for linear sequence alignment. The new algorithm has an O(N(2) log N) memory complexity, at the expense of a small constant factor in time. CONCLUSIONS: Optimal ribosomal RNA structural alignments that previously required up to 150 GB of memory now require less than 270 MB

    Scaling up classification rule induction through parallel processing

    Get PDF
    The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction
    • 

    corecore