92,036 research outputs found
Massively Parallel Algorithm for Multiple Sequence Alignment Based on Artificial Bee Colony
In silico biological sequence processing is a key task in molecular biology. This scientific area requires powerful computing
resources for exploring large sets of biological data. Parallel in silico simulations based on methods and algorithms for analysis of
biological data using high-performance distributed computing is essential for accelerating the research and reducing the investment.
Multiple sequence alignment is a widely used method for biological sequence processing. The goal of this method is DNA and protein sequences alignment. This paper presents an innovative parallel algorithm MSA_BG for multiple alignment of biological sequences that is highly scalable and locality aware. The MSA_BG algorithm we describe is iterative and is based on the concept of Artificial Bee Colony metaheuristics and the concept of algorithmic and architectural spaces correlation. The metaphor of the ABC
metaheuristics has been constructed and the functionalities of the agents have been defined. The conceptual parallel model of
computation has been designed and the algorithmic framework of the designed parallel algorithm constructed. Experimental
simulations on the basis of parallel implementation of MSA_BG algorithm for multiple sequences alignment on heterogeneouc
compact computer cluster and supercomputer BlueGene/P have been carried out for the case study of the influenza virus variability
investigation. The performance estimation and profiling analyses have shown that the parallel system is well balanced both in
respect to the workload and machine size
Formal executable descriptions of biological systems
The similarities between systems of living entities and systems of concurrent processes may support biological experiments in silico. Process calculi offer a formal framework to describe biological systems, as well as to analyse their behaviour, both from a qualitative and a quantitative point of view. A couple of little examples help us in showing how this can be done. We mainly focus our attention on the qualitative and quantitative aspects of the considered biological systems, and briefly illustrate which kinds of analysis are possible. We use a known stochastic calculus for the first example. We then present some statistics collected by repeatedly running the specification, that turn out to agree with those obtained by experiments in vivo. Our second example motivates a richer calculus. Its stochastic extension requires a non trivial machinery to faithfully reflect the real dynamic behaviour of biological systems
Regulatory motif discovery using a population clustering evolutionary algorithm
This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences
An optimized TOPS+ comparison method for enhanced TOPS models
This article has been made available through the Brunel Open Access Publishing Fund.Background
Although methods based on highly abstract descriptions of protein structures, such as VAST and TOPS, can perform very fast protein structure comparison, the results can lack a high degree of biological significance. Previously we have discussed the basic mechanisms of our novel method for structure comparison based on our TOPS+ model (Topological descriptions of Protein Structures Enhanced with Ligand Information). In this paper we show how these results can be significantly improved using parameter optimization, and we call the resulting optimised TOPS+ method as advanced TOPS+ comparison method i.e. advTOPS+.
Results
We have developed a TOPS+ string model as an improvement to the TOPS [1-3] graph model by considering loops as secondary structure elements (SSEs) in addition to helices and strands, representing ligands as first class objects, and describing interactions between SSEs, and SSEs and ligands, by incoming and outgoing arcs, annotating SSEs with the interaction direction and type. Benchmarking results of an all-against-all pairwise comparison using a large dataset of 2,620 non-redundant structures from the PDB40 dataset [4] demonstrate the biological significance, in terms of SCOP classification at the superfamily level, of our TOPS+ comparison method.
Conclusions
Our advanced TOPS+ comparison shows better performance on the PDB40 dataset [4] compared to our basic TOPS+ method, giving 90 percent accuracy for SCOP alpha+beta; a 6 percent increase in accuracy compared to the TOPS and basic TOPS+ methods. It also outperforms the TOPS, basic TOPS+ and SSAP comparison methods on the Chew-Kedem dataset [5], achieving 98 percent accuracy. Software Availability: The TOPS+ comparison server is available at http://balabio.dcs.gla.ac.uk/mallika/WebTOPS/.This article is available through the Brunel Open Access Publishing Fun
Application of clustering analysis and sequence analysis on the performance analysis of parallel applications
High Performance Computing and Supercomputing is the high end area of the computing science that studies and develops the most powerful computers available. Current supercomputers are extremely complex so are the applications that run on them. To take advantage
of the huge amount of computing power available it is strictly necessary to maximize the knowledge we have about how these applications behave and perform. This is the mission of the (parallel) performance analysis.
In general, performance analysis toolkits oUer a very simplistic manipulations of the performance data. First order statistics such as average or standard deviation are used to summarize the values of a given performance metric, hiding in some cases interesting facts available on the raw performance data. For this reason, we require the Performance Analytics, i.e. the application of Data Analytics techniques in the performance analysis area. This thesis contributes with two new techniques to the Performance Analytics Veld.
First contribution is the application of the cluster analysis to detect the parallel application computation structure. Cluster analysis is the unsupervised classiVcation of patterns (observations, data items or feature vectors) into groups (clusters). In this thesis we use the
cluster analysis to group the CPU burst of a parallel application, the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the diUerent computational trends or phases that appear in the application. These clusters are useful to understand the behaviour of computation part of the application and focus the analyses to those that present performance issues. We demonstrate that our approach requires diUerent clustering algorithms previously used in the area.
Second contribution of the thesis is the application of multiple sequence alignment algorithms to evaluate the computation structure detected. Multiple sequence alignment (MSA) is technique commonly used in bioinformatics to determine the similarities across two or more biological sequences: DNA or roteins. The Cluster Sequence Score we introduce applies a Multiple Sequence Alignment (MSA) algorithm to evaluate the SPMDiness of an application, i.e. how well its computation structure represents the Single Program Multiple Data (SPMD) paradigm structure. We also use this score in the Aggregative Cluster Re-Vnement, a new clustering algorithm we designed, able to detect the SPMD phases of an application at Vne-grain, surpassing the cluster algorithms we used initially. We demonstrate the usefulness of these techniques with three practical uses. The Vrst one is an extrapolation methodology able to maximize the performance metrics that characterize the application phases detected using a single application execution. The second one is the use of the computation structure detected to speedup in a multi-level simulation infrastructure. Finally, we analyse four production-class applications using the computation
characterization to study the impact of possible application improvements and portings of the applications to diUerent hardware conVgurations.
In summary, this thesis proposes the use of cluster analysis and sequence analysis to automatically detect and characterize the diUerent computation trends of a parallel application. These techniques provide the developer / analyst an useful insight of the application performance
and ease the understanding of the application’s behaviour. The contributions of the thesis are not reduced to proposals and publications of the techniques themselves, but also practical uses to demonstrate their usefulness in the analysis task. In addition, the research carried out during these years has provided a production tool for analysing applications’ structure, part of BSC Tools suite
- …