Search CORE

2,969 research outputs found

Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

Author: Halachev Mihail R.
Loman Nicholas J.
Pallen Mark J.
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/

Directory of Open Access Journals

Warwick Research Archives Portal Repository

Unfolding RNA 3D structures for secondary structure prediction benchmarking

Author: C-Parent Gabriel
Publication venue
Publication date: 01/01/2017
Field of study

Les acides ribonucléiques (ARN) forment des structures tri-dimensionnelles complexes stabilisées par la formation de la structure secondaire (2D), elle-même formée de paires de bases. Plusieurs méthodes computationnelles ont été créées dans les dernières années afin de prédire la structure 2D d’ARNs, en partant de la séquence. Afin de simplifier le calcul, ces méthodes appliquent généralement des restrictions sur le type de paire de bases et la topologie des structures 2D prédites. Ces restrictions font en sorte qu’il est parfois difficile de savoir à quel point la totalité des paires de bases peut être représentée par ces structures 2D restreintes. MC-Unfold fut créé afin de trouver les structures 2D restreintes qui pourraient être associées à une structure secondaire complète, en fonction des restrictions communément utilisées par les méthodes de prédiction de structure secondaire. Un ensemble de 321 monomères d’ARN totalisant plus de 4223 structures fut assemblé afin d’évaluer les méthodes de prédiction de structure 2D. La majorité de ces structures ont été déterminées par résonance magnétique nucléaire et crystallographie aux rayons X. Ces structures ont été dépliés par MC-Unfold et les structures résultantes ont été comparées à celles prédites par les méthodes de prédiction. La performance de MC-Unfold sur un ensemble de structures expérimentales est encourageante. En moins de 5 minutes, 96% des 227 structures ont été complètement dépliées, le reste des structures étant trop complexes pour être déplié rapidement. Pour ce qui est des méthodes de prédiction de structure 2D, les résultats indiquent qu’elles sont capable de prédire avec un certain succès les structures expérimentales, particulièrement les petites molécules. Toutefois, si on considère les structures larges ou contenant des pseudo-noeuds, les résultats sont généralement défavorables. Les résultats obtenus indiquent que les méthodes de prédiction de structure 2D devraient être utilisées avec prudence, particulièrement pour de larges molécules.Ribonucleic acids (RNA) adopt complex three dimensional structures which are stabilized by the formation of base pairs, also known as the secondary (2D) structure. Predicting where and how many of these interactions occur has been the focus of many computational methods called 2D structure prediction algorithms. These methods disregard some interactions, which makes it difficult to know how well a 2D structure represents an RNA structure, especially when large amounts of base pairs are ignored. MC-Unfold was created to remove interactions violating the assumptions used by prediction methods. This process, named unfolding, extends previous planarization and pseudoknot removal methods. To evaluate how well computational methods can predict experimental structures, a set of 321 RNA monomers corresponding to more than 4223 experimental structures was acquired. These structures were mostly determined using nuclear magnetic resonance and X-ray crystallography. MC-Unfold was used to remove interactions the prediction algorithms were not expected to predict. These structures were then compared with the structured predicted. MC-Unfold performed very well on the test set it was given. In less than five minutes, 96% of the 227 structure could be exhaustively unfolded. The few remaining structures are very large and could not be unfolded in reasonable time. MC-Unfold is therefore a practical alternative to the current methods. As for the evaluation of prediction methods, MC-Unfold demonstrated that the computational methods do find experimental structures, especially for small molecules. However, when considering large or pseudoknotted molecules, the results are not so encouraging. As a consequence, 2D structure prediction methods should be used with caution, especially for large structures

Exploration of Reaction Pathways and Chemical Transformation Networks

Author: Reiher Markus
Simm Gregor N.
Vaucher Alain C.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 03/12/2018
Field of study

For the investigation of chemical reaction networks, the identification of all relevant intermediates and elementary reactions is mandatory. Many algorithmic approaches exist that perform explorations efficiently and automatedly. These approaches differ in their application range, the level of completeness of the exploration, as well as the amount of heuristics and human intervention required. Here, we describe and compare the different approaches based on these criteria. Future directions leveraging the strengths of chemical heuristics, human interaction, and physical rigor are discussed.Comment: 48 pages, 4 figure

arXiv.org e-Print Archive

Repository for Publications and Research Data

A divide-and-conquer approach to analyze underdetermined biochemical models

Author: Heinemann Matthias
Kotte Oliver
Publication venue
Publication date: 02/08/2017
Field of study

Motivation: To obtain meaningful predictions from dynamic computational models, their uncertain parameter values need to be estimated from experimental data. Due to the usually large number of parameters compared to the available measurement data, these estimation problems are often underdetermined meaning that the solution is a multidimensional space. In this case, the challenge is yet to obtain a sound system understanding despite non-identifiable parameter values, e.g. through identifying those parameters that most sensitively determine the model's behavior. Results: Here, we present the so-called divide-and-conquer approach—a strategy to analyze underdetermined biochemical models. The approach draws on steady state omics measurement data and exploits a decomposition of the global estimation problem into independent subproblems. The solutions to these subproblems are joined to the complete space of global optima, which can be easily analyzed. We derive the conditions at which the decomposition occurs, outline strategies to fulfill these conditions and—using an example model—illustrate how the approach uncovers the most important parameters and suggests targeted experiments without knowing the exact parameter values. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

Evaluation of protein surface roughness index using its heat denatured aggregates

Author: Hrishikesh Mishra
Tapobrata Lahiri
Publication venue
Publication date: 27/08/2009
Field of study

Recent research works on potential of different protein surface describing parameters to predict protein surface properties gained significance for its possible implication in extracting clues on protein's functional site. In this direction, Surface Roughness Index, a surface topological parameter, showed its potential to predict SCOP-family of protein. The present work stands on the foundation of these works where a semi-empirical method for evaluation of Surface Roughness Index directly from its heat denatured protein aggregates (HDPA) was designed and demonstrated successfully. The steps followed consist, the extraction of a feature, Intensity Level Multifractal Dimension (ILMFD) from the microscopic images of HDPA, followed by the mapping of ILMFD into Surface Roughness Index (SRI) through recurrent backpropagation network (RBPN). Finally SRI for a particular protein was predicted by clustering of decisions obtained through feeding of multiple data into RBPN, to obtain general tendency of decision, as well as to discard the noisy dataset. The cluster centre of the largest cluster was found to be the best match for mapping of Surface Roughness Index of each protein in our study. The semi-empirical approach adopted in this paper, shows a way to evaluate protein's surface property without depending on its already evaluated structure

A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure

Author: Eddy Sean R
Publication venue: BioMed Central
Publication date: 01/01/2002
Field of study

BACKGROUND: Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N(3)) in memory. This is only practical for small RNAs. RESULTS: I describe a divide and conquer variant of the alignment algorithm that is analogous to memory-efficient Myers/Miller dynamic programming algorithms for linear sequence alignment. The new algorithm has an O(N(2) log N) memory complexity, at the expense of a small constant factor in time. CONCLUSIONS: Optimal ribosomal RNA structural alignments that previously required up to 150 GB of memory now require less than 270 MB

CiteSeerX

Springer - Publisher Connector

Directory of Open Access Journals

Scaling up classification rule induction through parallel processing

Author: Bramer Max
Stahl Frederic
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 26/11/2012
Field of study

The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction