560 research outputs found

    Evolutionary distances in the twilight zone -- a rational kernel approach

    Get PDF
    Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.Comment: to appear in PLoS ON

    AlignStat: a web-tool and R package for statistical comparison of alternative multiple sequence alignments

    Get PDF
    Background: Alternative sequence alignment algorithms yield different results. It is therefore useful to quantify the similarities and differences between alternative alignments of the same sequences. These measurements can identify regions of consensus that are likely to be most informative in downstream analysis. They can also highlight systematic differences between alignments that relate to differences in the alignment algorithms themselves. Results: Here we present a simple method for aligning two alternative multiple sequence alignments to one another and assessing their similarity. Differences are categorised into merges, splits or shifts in one alignment relative to the other. A set of graphical visualisations allow for intuitive interpretation of the data. Conclusions: AlignStat enables the easy one-off online use of MSA similarity comparisons or into R pipelines. The web-tool is available at AlignStat.Science.LaTrobe.edu.au. The R package, readme and example data are available on CRAN and GitHub.com/TS404/AlignStat

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

    Estimating evolutionary dynamics of cleavage site peptides among H5HA avian influenza employing mathematical information theory approaches

    Get PDF
    Estimating evolutionary conservation of cleavage site peptides among HA protein of all strains facilitates vaccine development against pandemic influenza. Conserved epitopes may be useful for diagnosis of animals infected with the influenza virus, and preventing their spread in other regions [ 1]. In the preliminary stage of this study, in silico analysis of hemagglutinin was applied to predict potential cleavage sites of each strain employing SigCleave [2] and SignalP 3.0 server [3]. The second stage of the study focused on analyzing the structure of connecting peptides of hemagglutinin cleavage sites based on the availability of the existing experimental data. Our result divulges higher frequency of base amino acids, essential for processing by the cellular protease, among pathogenic strains compared with non/low pathogenic strains. In addition, two complementary methods for identifying conserved amino acids were applied: statistical entropy based method, possibly the most sensitive tool to estimate the diversity of peptides [5], and relative entropy estimation. Analysis of both methods demonstrates that the connecting peptide of HA cleavage site of AIV in the United States were highly conserved over long periods of time. Entropy values aid to select those sequences that have the highest potential for mutation in a broad spectrum of avian population. Position 340 among our group of strains with the entropy value of 0.877928 has the highest bit of information value where highly conserved positions are those with

    Towards a deeper understanding of protein sequence evolution

    Get PDF
    Most bioinformatic analyses start by building sequence alignments by means of scoring matrices. An implicit approximation on which many scoring matrices are built is that protein sequence evolution is considered a sequence of Point Accepted Mutations (PAM) (Dayhoff et al., 1978), in which each substitution happens independently of the history of the sequence, namely with a probability that depends only on the initial and final amino acids. But different protein sites evolve at a different rate (Echave et al., 2016) and this feature, though included in many phylogenetic reconstruction algorithms, is generally neglected when building or using substitution matrices. Moreover, substitutions at different protein sites are known to be entangled by coevolution (de Juan et al., 2013). This thesis is devoted to the analysis of the consequences of neglecting these effects and to the development of models of protein sequence evolution capable of incorporating them. We introduce a simple procedure that allows including the among-site rate variability in PAM-like scoring matrices through a mean-field-like framework, and we show that rate variability leads to non trivial evolutions when considering whole protein sequences. We also propose a procedure for deriving a substitution rate matrix from Single Nucleotide Polymorphisms (SNPs): we first test the statistical compatibility of frequent genetic variants within a species and substitutions accumulated between species; moreover we show that the matrix built from SNPs faithfully describes substitution rates for short evolutionary times, if rate variability is taken into account. Finally, we present a simple model, inspired by coevolution, capable of predicting at the same time the along-chain correlation of substitutions and the time variability of substitution rates. This model is based on the idea that a mutation at a site enhances the probability of fixing mutations in the other protein sites in its spatial proximity, but only for a certain amount of time

    Applications of Evolutionary Bioinformatics in Basic and Biomedical Research

    Get PDF
    With the revolutionary progress in sequencing technologies, computational biology emerged as a game-changing field which is applied in understanding molecular events of life for not only complementary but also exploratory purposes. Bioinformatics resources and tools significantly help in data generation, organization and analysis. However, there is still a need for developing new approaches built based on a biologist’s point of view. In protein bioinformatics, there are several fundamental problems such as (i) determining protein function; (ii) identifying protein-protein interactions; (iii) predicting the effect of amino acid variants. Here, I present three chapters addressing these problems from an evolutionary perspective. Firstly, I describe a novel search pipeline for protein domain identification. The algorithm chain provides sensitive domain assignments with the highest possible specificity. Secondly, I present a tool enabling large-scale visualization of presences and absences of proteins in hierarchically clustered genomes. This tool visualizes multi-layer information of any kind of genome-linked data with a special focus on domain architectures, enabling identification of coevolving domains/proteins, which can eventually help in identifying functionally interacting proteins. And finally, I propose an approach for distinguishing between benign and damaging missense mutations in a human disease by establishing the precise evolutionary history of the associated gene. This part introduces new criteria on how to determine functional orthologs via phylogenetic analysis. All three parts use comparative genomics and/or sequence analyses. Taken together, this study addresses important problems in protein bioinformatics and as a whole it can be utilized to describe proteins by their domains, coevolving partners and functionally important residues

    Reconstruction of Kauffman networks applying trees

    Get PDF
    AbstractAccording to Kauffman’s theory [S. Kauffman, The Origins of Order, Self-Organization and Selection in Evolution, Oxford University Press, New York, 1993], enzymes in living organisms form a dynamic network, which governs their activity. For each enzyme the network contains:•a collection of enzymes affecting the enzyme and•a Boolean function prescribing next activity of the enzyme as a function of the present activity of the affecting enzymes.Kauffman’s original pure random structure of the connections was criticized by Barabasi and Albert [A.-L. Barabasi, R. Albert, Emergence of scaling in random networks, Science 286 (1999) 509–512]. Their model was unified with Kauffman’s network by Aldana and Cluzel [M. Aldana, P. Cluzel, A natural class of robust networks, Proc. Natl. Acad. Sci. USA 100 (2003) 8710–8714]. Kauffman postulated that the dynamic character of the network determines the fitness of the organism. If the network is either convergent or chaotic, the chance of survival is lessened. If, however, the network is stable and critical, the organism will proliferate. Kauffman originally proposed a special type of Boolean functions to promote stability, which he called the property canalyzing. This property was extended by Shmulevich et al. [I. Shmulevich, H. Lähdesmäki, E.R. Dougherty, J. Astola, W. Zhang, The role of certain Post classes in Boolean network models of genetic networks, Proc. Natl. Acad. Sci. USA 100 (2003) 10734–10739] using Post classes. Following their ideas, we propose decision tree functions for enzymatic interactions. The model is fitted to microarray data of Cogburn et al. [L.A. Cogburn, W. Wang, W. Carre, L. Rejtő, T.E. Porter, S.E. Aggrey, J. Simon, System-wide chicken DNA microarrays, gene expression profiling, and discovery of functional genes, Poult. Sci. Assoc. 82 (2003) 939–951; L.A. Cogburn, X. Wang, W. Carre, L. Rejtő, S.E. Aggrey, M.J. Duclos, J. Simon, T.E. Porter, Functional genomics in chickens: development of integrated-systems microarrays for transcriptional profiling and discovery of regulatory pathways, Comp. Funct. Genom. 5 (2004) 253–261]. In microarray measurements the activity of clones is measured. The problem here is the reconstruction of the structure of enzymatic interactions of the living organism using microarray data. The task resembles summing up the whole story of a film from unordered and perhaps incomplete collections of its pieces. Two basic ingredients will be used in tackling the problem. In our earlier works [L. Rejtő, G. Tusnády, Evolution of random Boolean NK-models in Tierra environment, in: I. Berkes, E. Csaki, M. Csörgő (Eds.), Limit Theorems in Probability an Statistics, Budapest, vol. II, 2002, pp. 499–526] we used an evolutionary strategy called Tierra, which was proposed by Ray [T.S. Ray, Evolution, complexity, entropy and artificial reality, Physica D 75 (1994) 239–263] for investigating complex systems. Here we apply this method together with the tree–structure of clones found in our earlier statistical analysis of microarray measurements [L. Rejtő, G. Tusnády, Clustering methods in microarrays, Period. Math. Hungar. 50 (2005) 199–221]

    Integrated multiple sequence alignment

    Get PDF
    Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases. Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility. Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs. Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model

    Automated Genome-Wide Protein Domain Exploration

    Get PDF
    Exploiting the exponentially growing genomics and proteomics data requires high quality, automated analysis. Protein domain modeling is a key area of molecular biology as it unravels the mysteries of evolution, protein structures, and protein functions. A plethora of sequences exist in protein databases with incomplete domain knowledge. Hence this research explores automated bioinformatics tools for faster protein domain analysis. Automated tool chains described in this dissertation generate new protein domain models thus enabling more effective genome-wide protein domain analysis. To validate the new tool chains, the Shewanella oneidensis and Escherichia coli genomes were processed, resulting in a new peptide domain database, detection of poor domain models, and identification of likely new domains. The automated tool chains will require months or years to model a small genome when executing on a single workstation. Therefore the dissertation investigates approaches with grid computing and parallel processing to significantly accelerate these bioinformatics tool chains
    corecore