2,969 research outputs found
Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach
Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a âdivide and conquerâ approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/
Unfolding RNA 3D structures for secondary structure prediction benchmarking
Les acides ribonucléiques (ARN) forment des structures tri-dimensionnelles complexes
stabilisĂ©es par la formation de la structure secondaire (2D), elle-mĂȘme formĂ©e de paires
de bases. Plusieurs méthodes computationnelles ont été créées dans les derniÚres années
afin de prĂ©dire la structure 2D dâARNs, en partant de la sĂ©quence. Afin de simplifier
le calcul, ces méthodes appliquent généralement des restrictions sur le type de paire de
bases et la topologie des structures 2D prĂ©dites. Ces restrictions font en sorte quâil est
parfois difficile de savoir Ă quel point la totalitĂ© des paires de bases peut ĂȘtre reprĂ©sentĂ©e
par ces structures 2D restreintes.
MC-Unfold fut crĂ©Ă© afin de trouver les structures 2D restreintes qui pourraient ĂȘtre associĂ©es Ă une structure secondaire complĂšte, en fonction des restrictions communĂ©ment
utilisées par les méthodes de prédiction de structure secondaire.
Un ensemble de 321 monomĂšres dâARN totalisant plus de 4223 structures fut assemblĂ©
afin dâĂ©valuer les mĂ©thodes de prĂ©diction de structure 2D. La majoritĂ© de ces structures
ont été déterminées par résonance magnétique nucléaire et crystallographie aux rayons
X. Ces structures ont été dépliés par MC-Unfold et les structures résultantes ont été comparées à celles prédites par les méthodes de prédiction.
La performance de MC-Unfold sur un ensemble de structures expérimentales est encourageante. En moins de 5 minutes, 96% des 227 structures ont été complÚtement dépliées,
le reste des structures Ă©tant trop complexes pour ĂȘtre dĂ©pliĂ© rapidement. Pour ce qui est
des mĂ©thodes de prĂ©diction de structure 2D, les rĂ©sultats indiquent quâelles sont capable
de prédire avec un certain succÚs les structures expérimentales, particuliÚrement les petites molécules. Toutefois, si on considÚre les structures larges ou contenant des pseudo-noeuds, les résultats sont généralement défavorables. Les résultats obtenus indiquent que
les mĂ©thodes de prĂ©diction de structure 2D devraient ĂȘtre utilisĂ©es avec prudence, particuliĂšrement pour de larges molĂ©cules.Ribonucleic acids (RNA) adopt complex three dimensional structures which are stabilized by the formation of base pairs, also known as the secondary (2D) structure. Predicting where and how many of these interactions occur has been the focus of many computational methods called 2D structure prediction algorithms. These methods disregard
some interactions, which makes it difficult to know how well a 2D structure represents
an RNA structure, especially when large amounts of base pairs are ignored.
MC-Unfold was created to remove interactions violating the assumptions used by prediction methods. This process, named unfolding, extends previous planarization and
pseudoknot removal methods. To evaluate how well computational methods can predict
experimental structures, a set of 321 RNA monomers corresponding to more than 4223
experimental structures was acquired. These structures were mostly determined using
nuclear magnetic resonance and X-ray crystallography. MC-Unfold was used to remove
interactions the prediction algorithms were not expected to predict. These structures
were then compared with the structured predicted.
MC-Unfold performed very well on the test set it was given. In less than five minutes,
96% of the 227 structure could be exhaustively unfolded. The few remaining structures
are very large and could not be unfolded in reasonable time. MC-Unfold is therefore a
practical alternative to the current methods.
As for the evaluation of prediction methods, MC-Unfold demonstrated that the computational methods do find experimental structures, especially for small molecules. However,
when considering large or pseudoknotted molecules, the results are not so encouraging.
As a consequence, 2D structure prediction methods should be used with caution, especially for large structures
Exploration of Reaction Pathways and Chemical Transformation Networks
For the investigation of chemical reaction networks, the identification of
all relevant intermediates and elementary reactions is mandatory. Many
algorithmic approaches exist that perform explorations efficiently and
automatedly. These approaches differ in their application range, the level of
completeness of the exploration, as well as the amount of heuristics and human
intervention required. Here, we describe and compare the different approaches
based on these criteria. Future directions leveraging the strengths of chemical
heuristics, human interaction, and physical rigor are discussed.Comment: 48 pages, 4 figure
A divide-and-conquer approach to analyze underdetermined biochemical models
Motivation: To obtain meaningful predictions from dynamic computational models, their uncertain parameter values need to be estimated from experimental data. Due to the usually large number of parameters compared to the available measurement data, these estimation problems are often underdetermined meaning that the solution is a multidimensional space. In this case, the challenge is yet to obtain a sound system understanding despite non-identifiable parameter values, e.g. through identifying those parameters that most sensitively determine the model's behavior. Results: Here, we present the so-called divide-and-conquer approachâa strategy to analyze underdetermined biochemical models. The approach draws on steady state omics measurement data and exploits a decomposition of the global estimation problem into independent subproblems. The solutions to these subproblems are joined to the complete space of global optima, which can be easily analyzed. We derive the conditions at which the decomposition occurs, outline strategies to fulfill these conditions andâusing an example modelâillustrate how the approach uncovers the most important parameters and suggests targeted experiments without knowing the exact parameter values. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin
Evaluation of protein surface roughness index using its heat denatured aggregates
Recent research works on potential of different protein surface describing parameters to predict protein surface properties gained significance for its possible implication in extracting clues on protein's functional site. In this direction, Surface Roughness Index, a surface topological parameter, showed its potential to predict SCOP-family of protein. The present work stands on the foundation of these works where a semi-empirical method for evaluation of Surface Roughness Index directly from its heat denatured protein aggregates (HDPA) was designed and demonstrated successfully. The steps followed consist, the extraction of a feature, Intensity Level Multifractal Dimension (ILMFD) from the microscopic images of HDPA, followed by the mapping of ILMFD into Surface Roughness Index (SRI) through recurrent backpropagation network (RBPN). Finally SRI for a particular protein was predicted by clustering of decisions obtained through feeding of multiple data into RBPN, to obtain general tendency of decision, as well as to discard the noisy dataset. The cluster centre of the largest cluster was found to be the best match for mapping of Surface Roughness Index of each protein in our study. The semi-empirical approach adopted in this paper, shows a way to evaluate protein's surface property without depending on its already evaluated structure
A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure
BACKGROUND: Covariance models (CMs) are probabilistic models of RNA secondary structure, analogous to profile hidden Markov models of linear sequence. The dynamic programming algorithm for aligning a CM to an RNA sequence of length N is O(N(3)) in memory. This is only practical for small RNAs. RESULTS: I describe a divide and conquer variant of the alignment algorithm that is analogous to memory-efficient Myers/Miller dynamic programming algorithms for linear sequence alignment. The new algorithm has an O(N(2) log N) memory complexity, at the expense of a small constant factor in time. CONCLUSIONS: Optimal ribosomal RNA structural alignments that previously required up to 150 GB of memory now require less than 270 MB
Scaling up classification rule induction through parallel processing
The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction
- âŠ