871 research outputs found

    A New Simulated Annealing Algorithm for the Multiple Sequence Alignment Problem: The approach of Polymers in a Random Media

    Full text link
    We proposed a probabilistic algorithm to solve the Multiple Sequence Alignment problem. The algorithm is a Simulated Annealing (SA) that exploits the representation of the Multiple Alignment between DD sequences as a directed polymer in DD dimensions. Within this representation we can easily track the evolution in the configuration space of the alignment through local moves of low computational cost. At variance with other probabilistic algorithms proposed to solve this problem, our approach allows for the creation and deletion of gaps without extra computational cost. The algorithm was tested aligning proteins from the kinases family. When D=3 the results are consistent with those obtained using a complete algorithm. For D>3D>3 where the complete algorithm fails, we show that our algorithm still converges to reasonable alignments. Moreover, we study the space of solutions obtained and show that depending on the number of sequences aligned the solutions are organized in different ways, suggesting a possible source of errors for progressive algorithms.Comment: 7 pages and 11 figure

    Comparison of Spectra in Unsequenced Species

    Get PDF
    International audienceWe introduce a new algorithm for the mass spectromet- ric identication of proteins. Experimental spectra obtained by tandem MS/MS are directly compared to theoretical spectra generated from pro- teins of evolutionarily closely related organisms. This work is motivated by the need of a method that allows the identication of proteins of unsequenced species against a database containing proteins of related organisms. The idea is that matching spectra of unknown peptides to very similar MS/MS spectra generated from this database of annotated proteins can lead to annotate unknown proteins. This process is similar to ortholog annotation in protein sequence databases. The difficulty with such an approach is that two similar peptides, even with just one mod- ication (i.e. insertion, deletion or substitution of one or several amino acid(s)) between them, usually generate very dissimilar spectra. In this paper, we present a new dynamic programming based algorithm: Packet- SpectralAlignment. Our algorithm is tolerant to modications and fully exploits two important properties that are usually not considered: the notion of inner symmetry, a relation linking pairs of spectrum peaks, and the notion of packet inside each spectrum to keep related peaks together. Our algorithm, PacketSpectralAlignment is then compared to SpectralAlignment [1] on a dataset of simulated spectra. Our tests show that PacketSpectralAlignment behaves better, in terms of results and execution tim

    Expected length of the longest common subsequence for large alphabets

    Full text link
    We consider the length L of the longest common subsequence of two randomly uniformly and independently chosen n character words over a k-ary alphabet. Subadditivity arguments yield that the expected value of L, when normalized by n, converges to a constant C_k. We prove a conjecture of Sankoff and Mainville from the early 80's claiming that C_k\sqrt{k} goes to 2 as k goes to infinity.Comment: 14 pages, 1 figure, LaTe

    Safe and complete contig assembly via omnitigs

    Full text link
    Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph GG (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from GG as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

    Thermodynamics of protein folding: a random matrix formulation

    Full text link
    The process of protein folding from an unfolded state to a biologically active, folded conformation is governed by many parameters e.g the sequence of amino acids, intermolecular interactions, the solvent, temperature and chaperon molecules. Our study, based on random matrix modeling of the interactions, shows however that the evolution of the statistical measures e.g Gibbs free energy, heat capacity, entropy is single parametric. The information can explain the selection of specific folding pathways from an infinite number of possible ways as well as other folding characteristics observed in computer simulation studies.Comment: 21 Pages, no figure

    Parking functions, labeled trees and DCJ sorting scenarios

    Get PDF
    In genome rearrangement theory, one of the elusive questions raised in recent years is the enumeration of rearrangement scenarios between two genomes. This problem is related to the uniform generation of rearrangement scenarios, and the derivation of tests of statistical significance of the properties of these scenarios. Here we give an exact formula for the number of double-cut-and-join (DCJ) rearrangement scenarios of co-tailed genomes. We also construct effective bijections between the set of scenarios that sort a cycle and well studied combinatorial objects such as parking functions and labeled trees.Comment: 12 pages, 3 figure

    Limited Lifespan of Fragile Regions in Mammalian Evolution

    Full text link
    An important question in genome evolution is whether there exist fragile regions (rearrangement hotspots) where chromosomal rearrangements are happening over and over again. Although nearly all recent studies supported the existence of fragile regions in mammalian genomes, the most comprehensive phylogenomic study of mammals (Ma et al. (2006) Genome Research 16, 1557-1565) raised some doubts about their existence. We demonstrate that fragile regions are subject to a "birth and death" process, implying that fragility has limited evolutionary lifespan. This finding implies that fragile regions migrate to different locations in different mammals, explaining why there exist only a few chromosomal breakpoints shared between different lineages. The birth and death of fragile regions phenomenon reinforces the hypothesis that rearrangements are promoted by matching segmental duplications and suggests putative locations of the currently active fragile regions in the human genome

    Applying a User-centred Approach to Interactive Visualization Design

    Get PDF
    Analysing users in their context of work and finding out how and why they use different information resources is essential to provide interactive visualisation systems that match their goals and needs. Designers should actively involve the intended users throughout the whole process. This chapter presents a user-centered approach for the design of interactive visualisation systems. We describe three phases of the iterative visualisation design process: the early envisioning phase, the global specification hase, and the detailed specification phase. The whole design cycle is repeated until some criterion of success is reached. We discuss different techniques for the analysis of users, their tasks and domain. Subsequently, the design of prototypes and evaluation methods in visualisation practice are presented. Finally, we discuss the practical challenges in design and evaluation of collaborative visualisation environments. Our own case studies and those of others are used throughout the whole chapter to illustrate various approaches

    Group testing with Random Pools: Phase Transitions and Optimal Strategy

    Full text link
    The problem of Group Testing is to identify defective items out of a set of objects by means of pool queries of the form "Does the pool contain at least a defective?". The aim is of course to perform detection with the fewest possible queries, a problem which has relevant practical applications in different fields including molecular biology and computer science. Here we study GT in the probabilistic setting focusing on the regime of small defective probability and large number of objects, p0p \to 0 and NN \to \infty. We construct and analyze one-stage algorithms for which we establish the occurrence of a non-detection/detection phase transition resulting in a sharp threshold, Mˉ\bar M, for the number of tests. By optimizing the pool design we construct algorithms whose detection threshold follows the optimal scaling MˉNplogp\bar M\propto Np|\log p|. Then we consider two-stages algorithms and analyze their performance for different choices of the first stage pools. In particular, via a proper random choice of the pools, we construct algorithms which attain the optimal value (previously determined in Ref. [16]) for the mean number of tests required for complete detection. We finally discuss the optimal pool design in the case of finite pp

    Viral population estimation using pyrosequencing

    Get PDF
    The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure
    corecore