119,113 research outputs found

    Multiple sequence alignment based on set covers

    Full text link
    We introduce a new heuristic for the multiple alignment of a set of sequences. The heuristic is based on a set cover of the residue alphabet of the sequences, and also on the determination of a significant set of blocks comprising subsequences of the sequences to be aligned. These blocks are obtained with the aid of a new data structure, called a suffix-set tree, which is constructed from the input sequences with the guidance of the residue-alphabet set cover and generalizes the well-known suffix tree of the sequence set. We provide performance results on selected BAliBASE amino-acid sequences and compare them with those yielded by some prominent approaches

    A methodology for determining amino-acid substitution matrices from set covers

    Full text link
    We introduce a new methodology for the determination of amino-acid substitution matrices for use in the alignment of proteins. The new methodology is based on a pre-existing set cover on the set of residues and on the undirected graph that describes residue exchangeability given the set cover. For fixed functional forms indicating how to obtain edge weights from the set cover and, after that, substitution-matrix elements from weighted distances on the graph, the resulting substitution matrix can be checked for performance against some known set of reference alignments and for given gap costs. Finding the appropriate functional forms and gap costs can then be formulated as an optimization problem that seeks to maximize the performance of the substitution matrix on the reference alignment set. We give computational results on the BAliBASE suite using a genetic algorithm for optimization. Our results indicate that it is possible to obtain substitution matrices whose performance is either comparable to or surpasses that of several others, depending on the particular scenario under consideration

    Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

    Full text link
    Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor poly(log⁡n)\text{poly}(\log n). In this paper, we provide an algorithm with running time O~(n2−2/7)\tilde{O}(n^{2-2/7}) that approximates the edit distance within a constant factor

    Family-specific degenerate primer design: a tool to design consensus degenerated oligonucleotides

    Get PDF
    Designing degenerate PCR primers for templates of unknown nucleotide sequence may be a very difficult task. In this paper, we present a new method to design degenerate primers, implemented in family-specific degenerate primer design (FAS-DPD) computer software, for which the starting point is a multiple alignment of related amino acids or nucleotide sequences. To assess their efficiency, four different genome collections were used, covering a wide range of genomic lengths: Arenavirus ( nucleotides), Baculovirus ( to  bp), Lactobacillus sp. ( to  bp), and Pseudomonas sp. ( to  bp). In each case, FAS-DPD designed primers were tested computationally to measure specificity. Designed primers for Arenavirus and Baculovirus were tested experimentally. The method presented here is useful for designing degenerate primers on collections of related protein sequences, allowing detection of new family members.Fil: Iserte, Javier Alonso. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂ­a. Laboratorio de IngenierĂ­a GenĂ©tica y BiologĂ­a Molecular y Celular. Área de Virosis Emergentes y ZoonĂłtica; Argentina. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas; ArgentinaFil: Stephan, Betina InĂ©s. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂ­a. Laboratorio de IngenierĂ­a GenĂ©tica y BiologĂ­a Molecular y Celular. Área de Virosis Emergentes y ZoonĂłtica; Argentina. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas; ArgentinaFil: Goñi, Sandra Elizabeth. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂ­a. Laboratorio de IngenierĂ­a GenĂ©tica y BiologĂ­a Molecular y Celular. Área de Virosis Emergentes y ZoonĂłtica; Argentina. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas; ArgentinaFil: Borio, Cristina Silvia. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂ­a. Laboratorio de IngenierĂ­a GenĂ©tica y BiologĂ­a Molecular y Celular. Área de Virosis Emergentes y ZoonĂłtica; Argentina. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas; ArgentinaFil: Ghiringhelli, Pablo Daniel. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂ­a. Laboratorio de IngenierĂ­a GenĂ©tica y BiologĂ­a Molecular y Celular. Área Virus de Insectos; Argentina. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas; ArgentinaFil: Lozano, Mario Enrique. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas; Argentina. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂ­a. Laboratorio de IngenierĂ­a GenĂ©tica y BiologĂ­a Molecular y Celular. Área de Virosis Emergentes y ZoonĂłtica; Argentin

    Improved annotation of 3' untranslated regions and complex loci by combination of strand-specific direct RNA sequencing, RNA-seq and ESTs

    Get PDF
    The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3-prime untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3-prime polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental dataComment: 44 pages, 9 figure

    Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

    Get PDF
    Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

    Towards trajectory anonymization: a generalization-based approach

    Get PDF
    Trajectory datasets are becoming popular due to the massive usage of GPS and locationbased services. In this paper, we address privacy issues regarding the identification of individuals in static trajectory datasets. We first adopt the notion of k-anonymity to trajectories and propose a novel generalization-based approach for anonymization of trajectories. We further show that releasing anonymized trajectories may still have some privacy leaks. Therefore we propose a randomization based reconstruction algorithm for releasing anonymized trajectory data and also present how the underlying techniques can be adapted to other anonymity standards. The experimental results on real and synthetic trajectory datasets show the effectiveness of the proposed techniques

    MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

    Full text link
    Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
    • 

    corecore