82 research outputs found

    Data compression for sequencing data

    Get PDF
    Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question “why compression” in a quantitative manner. Then we also answer the questions “what” and “how”, by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question “why compression” and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology

    CompressĂŁo eficiente de sequĂȘncias biolĂłgicas usando uma rede neuronal

    Get PDF
    Background: The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of biosequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for biosequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA and amino acids models. For this purpose, we created GeCo3 and AC2, two new biosequence compressors. Both use a neural network for mixing the opinions of multiple specific models. Findings: We benchmark GeCo3 as a reference-free DNA compressor in five datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, two compilations of archaeal and virus genomes, four whole genomes, and two collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively. As a reference-based DNA compressor, we benchmark GeCo3 in four datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in 12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of this compression improvement is some additional computational time (1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool scales efficiently, independently from the sequence size. Overall, these values outperform the state-of-the-art. For AC2 the improvements and costs over AC are similar, which allows the tool to also outperform the state-of-the-art. Conclusions: The GeCo3 and AC2 are biosequence compressors with a neural network mixing approach, that provides additional gains over top specific biocompressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 and AC2 are released under GPLv3 and are available for free download at https://github.com/cobilab/geco3 and https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genĂłmicos levou a uma maior necessidade de modelos que possam lidar de forma eficiente com a compressĂŁo sem perdas de biosequĂȘncias. AplicaçÔes importantes incluem armazenamento de longo prazo e anĂĄlise de dados baseada em compressĂŁo. Na literatura, apenas alguns artigos recentes propĂ”em o uso de uma rede neuronal para compressĂŁo de biosequĂȘncias. No entanto, os resultados ficam aquĂ©m quando comparados com ferramentas de compressĂŁo de ADN especĂ­ficas, como o GeCo2. Essa limitação deve-se Ă  ausĂȘncia de modelos especĂ­ficos para sequĂȘncias de ADN. Neste trabalho, combinamos o poder de uma rede neuronal com modelos especĂ­ficos de ADN e aminoĂĄcidos. Para isso, criĂĄmos o GeCo3 e o AC2, dois novos compressores de biosequĂȘncias. Ambos usam uma rede neuronal para combinar as opiniĂ”es de vĂĄrios modelos especĂ­ficos. Resultados: Comparamos o GeCo3 como um compressor de ADN sem referĂȘncia em cinco conjuntos de dados, incluindo um conjunto de dados balanceado de sequĂȘncias de ADN, o cromossoma Y e o mitogenoma humano, duas compilaçÔes de genomas de arqueas e vĂ­rus, quatro genomas inteiros e duas coleçÔes de dados FASTQ de um viroma humano e ADN antigo. O GeCo3 atinge uma melhoria sĂłlida na compressĂŁo em relação Ă  versĂŁo anterior (GeCo2) de 2,4%, 7,1%, 6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN baseado em referĂȘncia, comparamos o GeCo3 em quatro conjuntos de dados constituĂ­dos pela compressĂŁo aos pares dos cromossomas dos genomas de vĂĄrios primatas. O GeCo3 melhora a compressĂŁo em 12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo desta melhoria de compressĂŁo Ă© algum tempo computacional adicional (1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM Ă© constante e a ferramenta escala de forma eficiente, independentemente do tamanho da sequĂȘncia. De forma geral, os rĂĄcios de compressĂŁo superam o estado da arte. Para o AC2, as melhorias e custos em relação ao AC sĂŁo semelhantes, o que permite que a ferramenta tambĂ©m supere o estado da arte. ConclusĂ”es: O GeCo3 e o AC2 sĂŁo compressores de sequĂȘncias biolĂłgicas com uma abordagem de mistura baseada numa rede neuronal, que fornece ganhos adicionais em relação aos biocompressores especĂ­ficos de topo. O mĂ©todo de mistura proposto Ă© portĂĄtil, exigindo apenas as probabilidades dos modelos como entradas, proporcionando uma fĂĄcil adaptação a outros compressores de dados ou ferramentas de anĂĄlise baseadas em compressĂŁo. O GeCo3 e o AC2 sĂŁo distribuĂ­dos sob GPLv3 e estĂŁo disponĂ­veis para download gratuito em https://github.com/ cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e TelemĂĄtic

    Pattern Discovery from Biosequences

    Get PDF
    In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/)

    University of Helsinki Department of Computer Science Annual Report 1998

    Get PDF

    Annotated Bibliography for the DEWPOINT project

    Full text link

    Structural Constraints Identified with Covariation Analysis in Ribosomal RNA

    Get PDF
    Covariation analysis is used to identify those positions with similar patterns of sequence variation in an alignment of RNA sequences. These constraints on the evolution of two positions are usually associated with a base pair in a helix. While mutual information (MI) has been used to accurately predict an RNA secondary structure and a few of its tertiary interactions, early studies revealed that phylogenetic event counting methods are more sensitive and provide extra confidence in the prediction of base pairs. We developed a novel and powerful phylogenetic events counting method (PEC) for quantifying positional covariation with the Gutell lab’s new RNA Comparative Analysis Database (rCAD). The PEC and MI-based methods each identify unique base pairs, and jointly identify many other base pairs. In total, both methods in combination with an N-best and helix-extension strategy identify the maximal number of base pairs. While covariation methods have effectively and accurately predicted RNAs secondary structure, only a few tertiary structure base pairs have been identified. Analysis presented herein and at the Gutell lab’s Comparative RNA Web (CRW) Site reveal that the majority of these latter base pairs do not covary with one another. However, covariation analysis does reveal a weaker although significant covariation between sets of nucleotides that are in proximity in the three-dimensional RNA structure. This reveals that covariation analysis identifies other types of structural constraints beyond the two nucleotides that form a base pair
    • 

    corecore