82 research outputs found
Data compression for sequencing data
Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question âwhy compressionâ in a quantitative manner. Then we also answer the questions âwhatâ and âhowâ, by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question âwhy compressionâ and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology
CompressĂŁo eficiente de sequĂȘncias biolĂłgicas usando uma rede neuronal
Background: The increasing production of genomic data has led to
an intensified need for models that can cope efficiently with the lossless
compression of biosequences. Important applications include long-term
storage and compression-based data analysis. In the literature, only a
few recent articles propose the use of neural networks for biosequence
compression. However, they fall short when compared with specific
DNA compression tools, such as GeCo2. This limitation is due to the
absence of models specifically designed for DNA sequences. In this
work, we combine the power of neural networks with specific DNA and
amino acids models. For this purpose, we created GeCo3 and AC2, two
new biosequence compressors. Both use a neural network for mixing
the opinions of multiple specific models.
Findings: We benchmark GeCo3 as a reference-free DNA compressor
in five datasets, including a balanced and comprehensive dataset
of DNA sequences, the Y-chromosome and human mitogenome, two
compilations of archaeal and virus genomes, four whole genomes, and
two collections of FASTQ data of a human virome and ancient DNA.
GeCo3 achieves a solid improvement in compression over the previous
version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively.
As a reference-based DNA compressor, we benchmark GeCo3 in four
datasets constituted by the pairwise compression of the chromosomes
of the genomes of several primates. GeCo3 improves the compression in
12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of
this compression improvement is some additional computational time
(1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool
scales efficiently, independently from the sequence size. Overall, these
values outperform the state-of-the-art. For AC2 the improvements and
costs over AC are similar, which allows the tool to also outperform the
state-of-the-art.
Conclusions: The GeCo3 and AC2 are biosequence compressors with
a neural network mixing approach, that provides additional gains over
top specific biocompressors. The proposed mixing method is portable,
requiring only the probabilities of the models as inputs, providing easy
adaptation to other data compressors or compression-based data analysis
tools. GeCo3 and AC2 are released under GPLv3 and are available
for free download at https://github.com/cobilab/geco3 and
https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma
maior necessidade de modelos que possam lidar de forma eficiente com
a compressĂŁo sem perdas de biosequĂȘncias. AplicaçÔes importantes
incluem armazenamento de longo prazo e anĂĄlise de dados baseada em
compressão. Na literatura, apenas alguns artigos recentes propÔem o
uso de uma rede neuronal para compressĂŁo de biosequĂȘncias. No entanto,
os resultados ficam aquém quando comparados com ferramentas
de compressĂŁo de ADN especĂficas, como o GeCo2. Essa limitação
deve-se Ă ausĂȘncia de modelos especĂficos para sequĂȘncias de ADN.
Neste trabalho, combinamos o poder de uma rede neuronal com modelos
especĂficos de ADN e aminoĂĄcidos. Para isso, criĂĄmos o GeCo3 e
o AC2, dois novos compressores de biosequĂȘncias. Ambos usam uma
rede neuronal para combinar as opiniĂ”es de vĂĄrios modelos especĂficos.
Resultados: Comparamos o GeCo3 como um compressor de ADN
sem referĂȘncia em cinco conjuntos de dados, incluindo um conjunto
de dados balanceado de sequĂȘncias de ADN, o cromossoma Y e o mitogenoma
humano, duas compilaçÔes de genomas de arqueas e vĂrus,
quatro genomas inteiros e duas coleçÔes de dados FASTQ de um viroma
humano e ADN antigo. O GeCo3 atinge uma melhoria sĂłlida
na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%,
6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN
baseado em referĂȘncia, comparamos o GeCo3 em quatro conjuntos
de dados constituĂdos pela compressĂŁo aos pares dos cromossomas
dos genomas de vĂĄrios primatas. O GeCo3 melhora a compressĂŁo em
12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo
desta melhoria de compressĂŁo Ă© algum tempo computacional adicional
(1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM Ă© constante e a
ferramenta escala de forma eficiente, independentemente do tamanho
da sequĂȘncia. De forma geral, os rĂĄcios de compressĂŁo superam o estado
da arte. Para o AC2, as melhorias e custos em relação ao AC são
semelhantes, o que permite que a ferramenta também supere o estado
da arte.
ConclusĂ”es: O GeCo3 e o AC2 sĂŁo compressores de sequĂȘncias biolĂłgicas
com uma abordagem de mistura baseada numa rede neuronal,
que fornece ganhos adicionais em relação aos biocompressores especĂficos
de topo. O método de mistura proposto é portåtil, exigindo apenas
as probabilidades dos modelos como entradas, proporcionando uma fĂĄcil
adaptação a outros compressores de dados ou ferramentas de anålise
baseadas em compressĂŁo. O GeCo3 e o AC2 sĂŁo distribuĂdos sob GPLv3
e estĂŁo disponĂveis para download gratuito em https://github.com/
cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e TelemĂĄtic
Pattern Discovery from Biosequences
In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/)
Structural Constraints Identified with Covariation Analysis in Ribosomal RNA
Covariation analysis is used to identify those positions with similar patterns of sequence variation in an alignment of RNA sequences. These constraints on the evolution of two positions are usually associated with a base pair in a helix. While mutual information (MI) has been used to accurately predict an RNA secondary structure and a few of its tertiary interactions, early studies revealed that phylogenetic event counting methods are more sensitive and provide extra confidence in the prediction of base pairs. We developed a novel and powerful phylogenetic events counting method (PEC) for quantifying positional covariation with the Gutell labâs new RNA Comparative Analysis Database (rCAD). The PEC and MI-based methods each identify unique base pairs, and jointly identify many other base pairs. In total, both methods in combination with an N-best and helix-extension strategy identify the maximal number of base pairs. While covariation methods have effectively and accurately predicted RNAs secondary structure, only a few tertiary structure base pairs have been identified. Analysis presented herein and at the Gutell labâs Comparative RNA Web (CRW) Site reveal that the majority of these latter base pairs do not covary with one another. However, covariation analysis does reveal a weaker although significant covariation between sets of nucleotides that are in proximity in the three-dimensional RNA structure. This reveals that covariation analysis identifies other types of structural constraints beyond the two nucleotides that form a base pair
- âŠ